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Preface 


In the field of Linguistics, Bibliometrics & Informetrics, Zipf's law 
commends a high influence. There are numerous recent applications of 
Zipfs law in Linguistics, Internet research. Geography, Medicine and 
Economics. Many people call Zipfs law as "one of the most puzzling 
phenomena in bibliometrics". Zipfs Law approximates the relationship 
between rank and frequency of any text. It describes the fact that when 
words are ranked on frequency, from most to least frequent, plotting rank 
against frequency yields a hyperbolic curve. 

Zipf attributed this law as a consequence of "Principle of Least Effort". 
Principle of Least Effort is relevant even today. If one views Zipfs law in 
terms of communication costs- one infers that communication costs 
increase as the number of words and their length grows. Thus, Zipfs law is 
applicable in understanding human language. There have been many 
applications of the law in natural languages, like Hindi, English, Urdu, 
Irish, Latin, Vietnamese, Chinese, Russian, Voyanich manuscript and 
random texts etc. 

'Z-ipfs goal was to put language study on a par with exact sciences, by use 
of "statistical techniques". Zipf attempted to prove that the key to the 
explanation of all synchronic and diachronic language-phenomena has 
beei found in a statistically estimated tendency to maintain equilibrium 
between .size and frequency. It is said that Zipfs law is perhaps the hest- 




known model of word probabilities. Researchers argued that Zipf's law is a 
reflection of a specific property of the organization of human memory, 
which usually operates with more frequent language units in all cases of 
the spontaneous use of speech. 


The present work postulates the hypotheses that the principle of least effort 
is a universal phenomenon; all writers would follow an economy in the use 
of words irrespective of the language concerned and the rank-frequency 
distribution of words would be similar in all languages and aims to attain 
the objective of finding the interrelationships between the rank and the 
frequency of a word in selected literatures; test the applicability of Zipfs 
law in diverse literatures and compare this applicability. 


I The thesis is divided in the following chapters; - 

li 

I Chapter 1 introduces Zipf’s law by defining and illustrating Mathematical 

: 

' Foundations of Zipf’s Law. It highlights the major thoughts on the Zipf's 
Law held by various researchers such as Hill, Price, Mandelbrot, Herdan 

I 

and Haitun. It also convoluted many more approaches on Zipf's Law. 

ij 

Chapter 2 does a review of literature on the applications of Zipf’s Law. It 
illustrates the application of Zipfs law in city populations, growth pattern 
of production companies, features of the Internet, finance and business, 
firm sizes, ecological systems, genomic data, earthquakes and clinical 
, diagnosis etc. It also describes how apart from aforesaid languages, Zipf's 



law in applicable in technical subjects, random texts, Monkey-type texts 


Chapter 3 discusses the research methodology involved in this work by 
defining the objectives, hypothesis and the data used. It filters out the 
reason for choosing appropriate software and describes the ones that are 


used in this work. It also discusses the ranking methods involved in ranking 


! the word frequencies. A note on nonlinear regression illustrates the models 

\\ 

|j involved. It defines the model families and illustrates their types. A note on 


! Project Gutenberg e-texts and the IIT Kanpur 's e-text is also embedded. 


Chapter 4 presents the analysis of frst sets of documents viz. Computer 


Science Literature, English Literature, German Literature, Zipf’s Law for 


English-German Business Dictionary, Hindi Literature, Library Science 


Literature, Urdu Literature and Sanskrit Literature. It also relates Zipfs 


I law and Flesch Readability Index. An effort is made in this chapter to 


highlight principle of least effort. 


Chapter 5 discusses issues pertaining to interrelationships between the 


rank & frequency; issues related to robustness of Zipfs Law and issues 


related to inter-literature comparison of the applicability of Zipf’s law. 
Chapter 6 presents summary, conclusions and suggestions for future 
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Introduction 


Introduction 


Zipf s Law 

George Kingsley Zipf was born on Jan 7 1902 in Freeport, 111. He graduated in 1924 
from Harvard, summa cum laude. He studied German studies at Bonn in 1925. In 
1929, he published a dissertation on “Relative Frequency as a determinant of 
phonetic change”. He was awarded a Ph.D. in 1930 on comparative philology from 
Harvard. He taught German language as Assistant professor of German (Harvard) in 
1936 and as University lecturer (Harvard) in 1939. He had interests in studying 
phonetic changes and thus worked on the frequency of “phonemes”. His work was 
more philosophical rather than mathematical in nature. Many other researchers tried 
to find a mathematical foundation of his work. He died on Sep 25 1950 at the age of 
48 only. Rousseau' (2002) presented a short biography Zipf and discussed his 
influence in the field of Informetrics and some recent applications of Zipfs law in 
Internet research, geography and economics. I 

As per HertzeP (1987), Zipf had an idea that “speech as a natural phenomenon” is 
really “a series of communicative gestures” and after extensive research found that 
“the length of a word, far from being a random matter is closely related to the 
frequency of its usage-the greater the frequency, the shorter the word”. Zipf also 
discovered that the “distribution of words in English approximates with remarkable 
precision a harmonic series... an unmistakable progression according to the inverse 
square, valid for well over 95% of all the different words in the sample”. 

Zipf formulated a law in 1930 that says frequency count (number of occurrence) of 
words in any text is inversely proportional to the rank of that word. In other words, 
the distribution of words adhered to a regular statistical pattern or “The probability 
of occurrence of words or other items starts high and tapers off exponentially. Thus, 
a few occur very often while many others occur rarely” (Black^ 2000). 

To further explain the basic form of the law, 

frequency * rank has a inversely proportional relationship: 
frequency * rank = constant or/* r = c ov log r + logf - log c 
Frequencies count of the words is the number of occurrences of the words in that 
text. The words are then arranged in the decreasing order of frequency so that the 


1 


most frequent word gets the highest rank. The frequency counts of words put in the 
same dictionary entries are regarded as the same. Zipf, in his first thesis, “Relative 
Frequency: A Determinant of Phonetic Change” wrote, “Observing the speech of 
many hundreds of millions of people, we have demonstrated, in part actually, in part 
by induction, that the conspicuousness or intensity of any element of language is 
inversely proportionate to its frequency. Using X for frequency, and Y for 

Conspicuousness (rank) we express our thesis thus: Y = — or AT = n , where n is 

JY 

some constant, the actual size or value of which need not be our immediate concern 
now”. 

Zipfs Law approximates the relationship between rank and frequency of any text. 
The tc.xt should consist of at least 5000 words in order for the product of r */ to be 
reasonably constant. HfebiCek & LudSk** (2002) discussed the questions related to 
corpus of more than 5000 words and discussed tautology in connection with the 
Zipf law. 

Zipf attributed this law as a consequence of “Principle of Least Effort". The 
Principle of Least BlTort postulates that a person would like to communicate in such 
a way as to minimize his total effort. According to Hertzel" (1987), “In simplest 
terms the Principle of Least Effort means, for example, that a person in solving his 
immediate problems will view these against the background of his probable future 
problems, as estimated by himself’. In other words, a person will tend to 
“minimize” the probable average of his work-expenditure (over time), meaning use 
of least amount of work. Principle of Least Effort is relevant even today. If one have 
Internet access to resources, he is more likely to use it than the library. 
Altmann^(2002) commented that Zipf s ideas are the foundation stones of modem 
quantitative linguistics and his influence is not restricted to linguistics but 
incessantly penetrates other sciences. According to Tague & Nicholls^ (1987), “The 
Zipfs distribution plays a central role in the modeling of human activities, 
particularly of the variable studied in Bibliometrics and Scientometrics- productivity 
of researchers in a discipline, impact of authors or publications, use of words in a 
text or keys in a database and dispersion of a subject literature among sources”. 
However Rapaport^ (1957) commented “Zipfs arguments are vague appeals to the 
recognition ( f the principle in a great variety of situations simply on the basis of its 
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plausibility. And even these appeals are often stretched far beyond ordinary 
credibility”. 

Wyllys** (1981) made a special study of Zipfs law and called it “one of the most 
puzzling phenomena in bibliometrics”. Wyllys* (1981) summarized the impact of 
the law as, “It is remarkable in its range of applicability to diverse phenomena, but 
we have not progressed far in an understanding why it should exist and why it 
siiould be so widespread”. Wyllys" suggests that dil'ferent slopes of Zipfs curves 
may characterize different subject fields. Another property of Zipfs law is that 
rank/frequency approximation is much better for the middle ranks than for the very 
highest ranks and the very lowest ranks. 

Zipf ^ (1949) in his work, “Human Behavior and the principle of least effort” 
viewed language as a "tool" that is shaped by its "jobs" in human society. The 
purpose of this book, which was an introduction to human ecology, “is to establish 
the Principle of least effort as the primary principle that governs our entire 
individual and collective behaviour of all sorts”. The study introduced the idea tiiat 
behaviors that are "useful" are performed frequently, and frequent behaviors 
become quicker and easier to perform. The very existence of these quick, easy 
behavior patterns tlieii causes individuals to choo.se them, even when they aren’t 
necessarily the best behavior from a functional point of view. As per Zipf, “An 
investigator who undertakes to propound any such primary scientific principal of 
human behaviour must discharge three major obligations” that is, have a large 
verifiable numbers, be consistent, and have an understandable presentation”. One 
observation of Zipf is “the greater the prestige of a person, the ever greater will be 
his power of attraction both for students and for grants of research money for the 
employment of technicians and for the purchase of expensive apparatus, with the 
result that his probable opportunities for making and reporting new ‘important’ and 
‘interesting’ observations will tend to increase exponentially (i.e., ‘nothing succeeds 
like succcss')”.Other works of Zipf were “Selective Studies and the Principle of 
Relative Frequency in Language'®” which as published in 1 932, “Psycho-Biology of 
Languages"” which was published in 1935 and “National Unity and Disunity: The 
Nation as a Bio-Social Organism'"” which was published in 1941. 

In the study “Psycho-Biology of Languages” Zipfs goal was to put language study 
on a par with exact sciences, by use of “statistical techniques”. It was an attempt to 


3 


prove that the key to the explanation of all synchronic and diachronic language- 
phenomena has been found in a statistically estimated tendency to maintain 
equilibrium between size and frequency. As per Hertzel^ (1987), “Zip f recognized 
that there had been accurate investigative studies of language for about 100 years 
but nothing has ever been found in the nature of speech in any of its manifestations 
which is not completely comprised in the statement that speech is but a form of 
human behaviour”. 

According to Wyllys* (1981), “Zipf appears to this writer to have been poorly 
trained for dealing with quantitative phenomena. His knowledge of mathematics 
was minimal; of statistics, apparently nonexistent. He never showed interest in 
exploring the quantitative nature of his data beyond noting that they came close to 
his model of the moment. This done, he would launch into lengthy speculations 
about hazily defined possible causes. It is a pity that he almost never collaborated 
with statisticians. On the other hand, he was an indefatigable worker, and pursued 
the rank-frequency phenomenon and related ideas for twenty years despite often 
harsh criticism. I'hcrc can be little doubt that the ubiquity of these p.hcnomcna 
would be less well recognized were it not for his work”. 

Madelbrot'^ (1953) tried to discuss Zipf s law in terms of communication costs and 
explained that the communication costs increases as the number of words and their 
length grows. Many years after his death linguistics agreed that speakers simplify 
communicaHon by using a small pool of words that they can retrieve quickly from 
their memory and listeners simplify communication by preferring words with a 
single and unambiguous meaning. This proved that Zipfs law is applicable in 
understanding human language. 

Definition and Mathematical Foundations of Zipfs Law 

Chen and Leimkuhlar''^ (1986) stated that if one takes the words making up an 
extended body of text and ranks them by their number of occurances, then the rank r 
multiplied by its corresponding frequency of occurance, g(r) will be approximately 
constant, that is, 

g(r) = ar~' ,r = 1,2,3... , where a is a positive constant. 
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There has been a debate as to if Zipf s law follows a Power-law or "stretched 
exponential" (Weibull) or "log-normal" or "Yule distribution”. 

More analysis showed that number of different words (N) of the same f-integral 
frequency of occurrence (under the conditions of the equation r x f = c) will be 
inversely proportional to the square of their frequency (approximately ) or, stated 


some what more precisely in equation form, 


= C 


This is Zipfs second law and has been called his “weak” law. Zipf s second law is 
also known as the discrete Pareto distribution‘s (1897), which involves count of 
vocabulary words (Cf) and their frequency. It states that 

CfOzMf 

Bi et al. (2001) explained Zipf distribution and the two Zipf “laws": the rank- 
frequency one and the frequency-count one. The laws are best described with an 
example, such as words in a book (or the Bible, as we show in Figure 1) Let Vhe 
the vocabulary si/.e,/i the occurrence frequency of the most frequent vocabulary 
word, andyS the second most frequent, and so on. 

Dei l n i t i on i : 7'he rank-frequency plot is the plot of the occurrence frequency f 
versus the rank r, in logarithmic scales 

The rank-frequency version of Zipfs law states that 

This is typically referred to as the Zipfs law or the Zipf distribution, in log-log 
scales, the Zipf distribution gives a straight line with slope -1. 

'I’he Zipf distribution (or “Zipf-like" distribution) is defined as 

where the log-log plot can be linear with any slope. 

The second 'law', also known as the discrete Pareto distribution, involves the \count- 
frequency" plot: let c/be the count of vocabulary words that appear / times in the 
document. The second Zipfs law states that 
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Many scientists have analyzed, refined and evaluated Zipfs endeavors, but Wyllys* 
who has made a special study of Zipfs law, called it “One of the most puzzling 
phenomena in Bibliometrics” and noted that the Zipfs law only approximates the 
relationship between rank r and frequency f for any actual corpus. Zipfs work 
showed that the approximation is much better for the middle ranks than that for the 
very lowest and the very highest ranks, and his work with samples of various sizes 
suggest that the corpus should consist of at least 5000 words in order for the product 
r X f to be rcaosnably constant, even in the middle ranks. 

Tague & Nicholls® (1987) commented that in general, Zipfs law may be described 
as representing the distributions of a set of tokens over a set of types. It has been 
represented in a number of functional forms, which may be distinguished by the 
number of parameters and by the nature of property or variable described, whether a 
size (frequency) or a rank. Zipfs distribution resembles in structure to many other 
distributions such as the Yule and Bradford distributions, and Lotka’s law. Each one 
of them has an empirical regularity in the study of many diverse subjects. There are 
four major school of thought on the theoretical underpinning of Zipfs law. The 
i()llovvitig (jil)k' (iciiiuiiNtratcs lliciii. 
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Major thoughts on the Zipfs Law 


Person 

Direction of Thought 

(D 

Hill & Woodroofe 
Sichel, Crowley, 
Bliss 

Zipfs law can be derived from stochastic processes 
[Hill’’ (1970)a, b'*, Hill & Woodroofe (1975), 

Sichel^° (1975), Crowley'^' (1975), Bliss“ (1953)] 

0) 

Hill 

Bose - Einstein form of the classical occupancy model 

<t> 

Bliss/Fisher 

Negative Binomial Model 



Simon/Price 

Many of the classical occupancy model can be 

manipulated to yeild hyperbolic distributions. 
Simon-^(1960), Price’'’ ( 1 976) 

O 

Simon 

Beta function 

ib 

Price 

(‘iinuilativc advanlagcd dislribulioii 


Cl> 

Mandelbrot 

Information theoretic approach to study the statistical 
structure [Mandelbrot'^ (1953)]. 


CD 

Herdan 

Works based on the field of quantitative linguistics 
[Herdan^^ (1964)] 



Haitun 

Brookes 

Laplace’s law of succession is shown to be the ‘Zipfian’ 
frequency analogue of the Bradford Law. [Brookes^^ 
(1984)/ Haitun^’ (1982)] 


Table 1.1: Researchers & their direction of thought about Zipf’s law 
Let us see, tlic woiks ol’dil'l'crcnl persons to gel an insight. 
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Hill’s Derh/atlon 

(1970) derived Zipfs Law from Bose-Einstein form of classical occupancy 
model with a random number of cells. It was proved that an extension of the Bose- 
Einstein model of allocations within regions yeiids convergence to a form of Zipfs 
law. (Generic specific form). It is described as a system of classification of units 
such that the proportion of clas.ses with exactly 's' units is in some specified sense 
approximately proportional to for some a > 0, with a Si as a case of interest 
(Hill & Woodroofe’^, 1975). This eventually is equivalent to Zipfs first law. 

The model proposed by hill cn be described as follows. Suppose there are N species 
which are to be distributed to M nonempty genera. Let Li be the number of species 
allocated to genus I, and let G(s) be the resulting number of genera with exactly s 
species. Suppose that the allocation of species to genera is of Bose-Einsteen form 

M 

forallL=(li. •••Im ) such that li>l,^/, =A'’ 

Suppo.se further that given N, M has a conditional distribution such that Pr{M / N 
<x I N} converges properly to a distribution F(x) with F(0) = 0. Then it was shown 
that G(s) / M, the proportion of genera with s species is in limits as N — ♦ 00, 
distributed like 0 (I-©)®*' where 0 denotes a random variable having distribution F. 
If© has a beta distribution B(a,b), i.e., if F has density function 

F\x) = ria + b)ir(a)r(b) r'x‘'-‘(l-x)*-', 

where T is the Gamma function, 0 < x < I, and a > 0,b >0 

Then, 

£{©(1 -©)'''}« ar(a + A)[r(^)]''s-‘“"‘'’ 

As .v-^oo, where the symbol ‘ indicates that the ratio of the two sides tends to 
unity. In fact this approximation is generally good even for small s. For example, if 
0 has the uniform distribution on the unit intervel, then 

£{ 0 (l- 0 )'-'} = [i’(i' + l)r‘ 
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'I'his is a simple and important form ofZipfs law fitting approximately a great 
variety of data. Thus they presented a conceptually simple model for an exceedingly 
complex ph'-’iionienoii. The most fundamental as.sumption, that underlying all the 
theory is tin approximate Bose-Einstein allocation of species to genera within a 
family. 

Price’s Derivation 

Price^'* (1976) postulated that it is based on cumulative advantage distribution, 
which can be derived from a modification of the Polya Urn model, or as a stochastic 
birth process. 

Let us consider a population of nr individuals, of whom, a fraction f(r) are in state r, 
where r is the total of “success” (occurances) thus far achieved by each of the 
individuals in the fraction f(r) of the population, 

I/M" I 

& the mean number of previous “successes” 

Y,rf{r)^R 

I 

lIThcrc arc i'urthcr ‘succc.sses” individual will move from stale r to r+1. 

Now suppose a small number dnr of new individuals are added to the population, 
under with them Rdnx new successes are sprinkled evenly at random over ail 
members, there will be dnx/nx new successes per previous ones, and for the class of 
nrf(r) individuals with r previous successes each, there will then be rnxf(r)/nx new 
successes, and therefore transitions from this rth state to be (r+I)th state, leads to 
rf(r) dnx transitions into it from the class below receiving its quota of new 
successes. The change in the number of individuals in the rth state is therefore 

^ ^ f(,.\ 

Yi tir)- 

^ dnj ( -/(I) ,r=l, 

so that 
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T dnrj, ^ [ -2/(l)+l .r=I, 

and the distribution over the states is defined by this series of difference-differential 
equations. For a stable distribution for which f(r) is independent of nr 

^ = 0 

dn^ 


=>/(O = -^/0-i) 

/• + ! 

_ /•-! r-2 r-3 
r + 1 r /• - 1 


or f(r) = 


1 

r(r + 1) 


i i 

3’2 


Which is the form for the urn model. 

Mandelbrot's Derivation 

Mandelbrot‘S had published several studies of generalizations of Zipf s law dealing 
with the question of whether the slope is -1 and with the deeper problem of 
explaining why the rf products should be relatively constant. Mandelbrot (1952, 
1964) assumed that llie aim of language is to transmit the must inlbrmation per 
symbol with the least effort. Following relationship is obtained. 

/(r) = Kir + cy^ 


Where, 

f(fj is the rank frequency and r is the rank of the word and c & 6 are constants, c 
improves the fit for small r and the exponent 0 improves the fit for large r. 

Mandelbrot showed that previous equation is similar to a regular lexicographical 
tree. He defines a lexicographical tree as one having (N+1) trunks, numbered 0 
through N, where the first trunk corresponds to a space (Empty word) and each of 
the others corresponds to a letter. Each of the "Letters" trunks has N+1 branches 
corresponding to the space and N letters. The space branch is again barren, and the 
others branch N+1 times each, and so on. The end of each branch corresponds to a 
word with a given probability [f(r)]. Booth'® (1967) proved that Mandelbrot's 
derivation and Zipfs Law are equivalent. 
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Herdan Waring Derivation 


Herdan ^ (1964) described Language as a coding system composed of individual 
speech utterances and the different words found in Language, while emphasising the 
independence of sound and meaning. Herdan then presented the following model of 
vocabulary frequency whose starting point is Waring's expansion for 


/.e, Where p>0, q>0 


Multiplying both sides by p-q 


Where the r'*' term represents y(r) in the frequency distribution 


{p-q){q){q-^ i) iq+r+x) 


For r = 2, 3, p & q are such o<q<p & f(r) is the probability that a word will 
appear with frequency r in large text. 


One can see that Zipf distribution can be derived from a beta function. It was 
suggested t(' have a theoretical justification for applying his derivation to text word 
distributions. The model is based on the fact that the author choose words according 
to the process of imagination, association and imitation (Fedorowicz"^, 1982) 


llaitun" / Brookes**’ Derivation 


Zipf, the professional linguist was more interested in his own field rather than 
statistics. But he accepted the statistical regularities found in his work. In order to 
do this Zipf deviated from the frequency distributions of orthodox statistics and 
postulated frequency rank distributions. In learning the vocabulary of a new 
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language, the words of most immediate interest are those which occur most 
frequently in texts in that language. There is a rough and ready rule of the thumb- 
the 80/20 rulc-wliich stales that 80% of the bibliography or of the language text is 
provided by the most productive 20% of the sources. Zipf adopted the unorthodox 
statistical technique of ranking their sources, beginning with the most frequent. The 
advantage of ranking is that it brings to the forefront of the distribution those items 
of greatest interest and relegates to the distant tail those items of rare occurrences 
which are relatively difficult to find and identify- thus reversing the procedure 
imposed by frequency distribution. 

As Zipfs law is concerned with ‘categorization’, let us see whether the Laplace law 
is related to them; 

The number of items, and therefore the number of entities to be ranked, in the tail of 
the Laplace distribution from x-m to its end point at x=n+l is given by 

k_ 

m n + \ 

k 

As both k and (n+1) are constants, we can put = w and rewrite the relation as 

n + 1 


k 

— = r + w 
m 


The number of items embraced by the Laplace law over this same range, m to (n-rl) 
is given by 


G{r)= \ — .X dx = k log^. (« + 1) - ^ log,, m 

^ X 
m 

= k !og,.(A/w) - k log,. kl{r + w), 

= k log,,(!-r/w') 

This equation is formally identical to the formulation of Bradford Law. .As Zipf did 
not cumulate the frequency of his f/r data, the Zipf law is given by 


g{r)- 


cIGjr) 

dr 



This is one of the forms propo.sed by Zipf. 
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The purpose of the series of derivations is to demonstrate the process by which the 
Zipfian approach to word distribution modelling can be represented, and also to 
establish a mathematical basis for its use in bibliographic database retrieval systems. 
Zipfs law is described in terms of the underlying processes governing the choice of 
texts words (and, indeed, many other phenomena). These processes also direct the 
distribution of the contents of the inverted file of such a database system, since the 
inverted file is merely an alternate method of arranging the words contained in the 
textual material (Fedorowicz*^, 1982). 

Some more approaches to Zipfs Law 

Bi ct al. (2001) : Zipfs law as a special case of Discrete Gaussian Exponential 
(DGX) distribution 

Bi et al. (2001) presented Zipfs law as a special case of DGX distribution. Their 
goal was to find a discrete distribution that will fit the PDF (a.k.a frequency-count 
plot) of many, real data sets. There were many options to fit distributions like 
parabola, third degree polynomial, gaussian, siiui.soid and .splines etc. But question 
arises; even if one of these functions fits in a few cases, do one has “a-priori" 
reasons to believe that it will fit well, in multiple settings? 

According to them, the answer to all this questions is proposed DGX distribution. 
Judging from the success of the lognormal (also referred to as “anti-lognormal") 
distribution for continuous data, they proposed the following thought experiment: 

Consider a random variable, say, the dutation of a web-surfing session. This is a 
continuous variable, and, most likely, might follow a lognormal distribution. 
However, we need to store it with finite accuracy, and thus turn it into an integer 
(number of minutes, or seconds, or hours). This is exactly the motivation behind 
DGX. 

Consider a lognormal random variable (by creating a Gaussian variable and 
exponentiating it); then, digitize it to the nearest integer. The same is true for 
everything else: salaries (digitized to penny accuracy), duration of hospital stays 
(rounded to days), body height (inches), body weight (pounds) and so on. There is a 
subtle, but important point; If the lognormal random variable becomes zero after the 
rounding, we omit it. This is necessary, since, e.g., we don't know how many 
vocabulary words have not appeared in our document. Notice that this omission 
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leads to the so-called “truncated" or “veiled" random variables, which are 
notoriously difficult with respect to their parameter estimations, in the continuous 
case. 

They presented their proposed discrete PDF. They proposed a distribution with the 
following PDF: 


P{x = k) 


A{}x,g) 


exp 


’ 2o-^ 


k = 1 , 2 ,. 


Where 


/f(//,a-) = ^^Yexp| 

A" 


{Xnk-ijf 

2o’’ 


is a normalization constant depending on p and o. This PDF has the following 
characteristics 

• It is discrete, which means it is suitable to model many real discrete 
distributions. 

• It is a discretized version of a known continuous distribution, the lognormal 
dislrihiitioii. As wc know, the IM)F ofa lognormal distribution is a parabola 
in log-log plot, which is nc-xt simplest model beyond a straight line. 

» This model has only two parameters to estimate, so it is not difficult to 
compute. 

ZipPs law as a special case 

Lemma 1 . The Discrete Gaussian Exponential (DGX) as defined by proposed PDF 
reduces to Zipfs law as //— >00 

Proof: We first rewrite proposed PDF as 


D/ 7 s 1 ( \nki\nk-2p)^ 

P(.v = ,t:)oc-exp ^ 

k \ la' 


Assume that In.^ = |//| , the PDF becomes 


P{x = k) a: --exp| 

k 


p\nk 
K ) 


OC k 
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which reduces to generalized Zipf distribution with slope < 9 = \-^<r. QED 

DGX works well on real data sets both when their PDF has a clear curvature and 
when the PDF is straight in log-log plot. 

Paruna k Anita^*^ (1979) : Formula that relates word length to the rank-size 

They developed a graphical means of comparing sets of ordered counts (of different 
overall size), without assuming the form of the distribution. Such a technique once 
available can also be used to display goodness of fit of specific analytic forms. In 
using the technique to compare different sets of word-frequency data, a formula was 
discovered that related word length to the rank-size rule. 

Suppose we have a total of ‘T’ items distributed over ‘D’ cells. The cell frequency 
(or size or count, Fi, is the number of items in cell i. Thus, 

/*=! 

When number ordered from greatest to least, the Fj may be denoted by 

The approximate relation 

r/v/’«r/, r = l,2,...,D 
where d and p (> 1) are constants 

This relation is called the rank-size rule or Zipfs Law. Notice that if some ranks are 
tied (that is if = for some integer a), then the products 

ri^„,(r + l)F^,^„,...,(r+fl)/J,^, cannot be equal. Also a plot of F,,) against r is J-shaped 

and usually very long tailed, because of the existence of many tied ranks, especially 
at low frequencies 

Tague & Nicholb^ (1987): Zipf’ s function describes the distribution of a set of ‘m ’ 
tokens over a set of ‘t ’ types 


In its most general form, the Zipfs function describes the distribution of a set of ‘m’ 
tokens over a set of ‘f types using one of the following expressions: 




a 

(x + c)*’ 


x = l,2,....,x,„^, a,b>0,c>Q 
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a 



/(r) = 


(r + c'/’ 


r = U2, t, a\b’>Q,c'^0 


Where g(x) is the number of types with exactly x tokens and f(r) is the number of 
tokens for the r'^’ ranking type when types are arranged in descending order of 
number of tokens. The function g(x) is commonly called a size-frequency 
distribution, as opposed to f(r), which is a rank-frequency distribution 


The parameter Xma^ represents the maximum number of tokens for a type, or the 
maximal size or value of the productivity variable x. Note that Xmax=f(l), that is the 
frequency of the highest ranked type. In most assumptions, c’ is assumed to be 0, 
that is 


= .v=: l,2...,x,„^^, a,b>Q 

In this case, the parameter a will represent the number of types with exactly one 
token. 1 he larger the exponent b, the larger will be this number relative to the total 
number of types. 

1 he ZipTs size frequency distribution cn be e.xpressed as a relative frequency or 
probability distribution by dividing by a suitable constraint. If X represents the 
number of tokens assigned to a random type, p(x) the probability X assumes a 
specilic value x, and t the total number of types, then 

= .v = l,2...,.v-,„„, a,b>Q 

The size variable X can be generalized to a continuous productivity variable, that is, 
productivity of a type rather than number of tokens of a type. The discrete Zipf 
distribution is then replaced by its continuous analogue, the Pareto distribution. 

Lee Breslau ct al. (1999) : ZipJ-like distribution with the varying exponent 

Consider a cache that receives a stream of requests for web pages. Let N be the total 
number of web pages in the universe. Let PN{i) be the conditional probability that, 
given the arrival of a page request, the arriving request is made for page i. Let ail the 
pages be lankcd in order of their popularity where page i is the i'** most popular 
page. We assume that PN(i), defined for i = 1,2,....N, has a “cut-off Zipf-like 
distribution given by 
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Where 


f ^ I A"' 

Q= V — 

-a 

V ^ J 

The true Zipfs law has a=l but if one considers a broader class of distribution 
lunctions vvitli exponents in the range 0<(i<l, each page request is drawn 
independently from the Zipfs distribution. 


I 

' 
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Review of Literature 


! here are many more attempts in proving that ZipPs law is actually a Power-law or 
"stretched exponential" (Weibull) or "log-normal" or "Yule distribution". To 
mention a few, Yule distribution (Martindale* et al., 1996), Log-normal distribution 
(Pcriine^ 1996) and Stretched exponential distribution (Laherrere^ et.al,, 1998).and 
double-pareto lognormal distribution (Reed"*'®, 2001, 2002, 2003). Bi® et al. (2001) 
proposed an alternate distribution called DGX, which included Zipf and generalized 
Zipf distributions as special cases. They commented, “...the Zipf distribution often 
fails to model real data sets well... the Zipf (or generalized-Zipf) distribution would 
expect the plots to be straight lines in logarithmic-logarithmic scales. However, we 
observe a clear tilting. Zipf himself had observed this deviation and even had a 
name for it (top concavity"), and he devoted several paragraphs in his book to 
justify it, whenever it appeared in a data set”. 

Laherrere' et al. (1998) commented that “Power laws are generally used to represent 
natural distributions, often claimed to be power laws which represent as linear 
regressions in log-log plots. In reality however, the plots often display linearity over 
a limited range of scales and/or exhibit noticeable curvature”. They found that 
stretched exponential distributions provide a reasonable fit to all data sets and has 
the advantage of a sound theoretical foundation. Stretched exponentials also have 
the advantage of being economical in their number of adjustable parameters. 

Mandelbrot*^ (1959) criticized Simon's model** (1955) concerning the class of 
frequency distributions generally associated with the name of G.K. Zipf. He 
commented that Simon’s model is analytically circular in some cases. Simon'^ 
(1960) refuted this and commented that the basic parameter of the distributions is 
almost always very close to unity and hence simple stochastic models can be 
constructed. Mandelbrot*^ (1961) maintained his objections to Simon's 1955 model 
for the Pareto-Yule-Zipf distribution. 

As cited in Hertzel*' (1987), Wyllys** (1975) summated as “Inclined towards 
mysticism. Zipf not only leaped to the conclusion that the ‘true’ slope of rank- 
frequency curves was -1, but also claimed that this regular slope resulted from some 
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fundamental force of nature. In the broad sense, this claim had to be correct; but 
Zipt vigorously described the force as that of struggle between the ‘Life Tendency’ 
and the ‘Death Tendency’ or the ‘Force of Diversification’ and the ‘Force of 
Unification' and finally as the ‘Principle of Least Effort’ for none of which did he 
furnish an operable definition. However, in work summarized... Zipf did show that 
an astonishingly wide range of phenomena... exhibited distributional behaviour that 
could be approximated by his *Law’”. As per Haitun'^ (1982), “Zipfs law applies to 
the distribution of many social characteristics, and that this law implies that social 
phenomena are inherently non-gaussian”. Chai Kim’’ (1982) investigated the extent 
to which the principle of least effort as advanced by Zipf provided a theoretical 
basis for identifying and updating descriptors of science/technology and social 
sciences. He found that “the relative frequency of occurrence of the descriptors of 
social sciences conformed to the theoretical distribution of Zipf while that of 
science/technology did not”. 

According to Thom & Zobel'® (1992), “Zipfs law is perhaps the best known model 
of word probabilitic.s. It describes the fact that when words are ranked on frequency, 
from most to least frequent, plotting rank against frequency yields a hyperbolic 
curve”. They argued that too much of emphasis has been placed on this result 
(Zipfs law). 

Witten & Bell’^ (1990) found that even words produced by a simple random 
generator conform t(» Zipl’s law. According to them, “Although theoretically 
elegant, Zipfs law provides only a loose fit to actual text and in practice must be 
modified by introduction of additional parameters”. 

The Russian statistician S.D. Haitun*^ published a three-part comprehensive review 
of ail the empirical frequency distributions that have been reported in the literature 
of bibliomelrics and related Helds, lie postulated that all the empirical distributions 
can be divided into two types. These are Gaussian type (G-type) and Zipfian type 
(Z-type). According to him G-type are those distributions that are characterized by 
the fact that these have as many higher moments as modern statistical theory 
demands. Z-type distributions have no moments whatever. According to Brookes 
( 1 984), “Gaussian-type distributions arise only in physical contexts; Zipfian only in 
social contexts. As the whole of modern statistical theory is based on Gaussian 
distributions, Haitun thus shows that its application to social statistics, including 
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cognitive statistics, is ‘inadmissible’. A new theory based on Zipfian distributions is 
therefore needed for the social sciences”. 

ivancheva- (2001) attempted to answer the question, “Why do most bibliometrics 
and scientometric laws reveal characters of Non-Gaussian distributions, i.e., have 
unduly long tails’. Ivancheva postulated a corollary that for a discrete pareto 
distributed random variable, a=l is the most reasonable value for family of Zipf 
laws, applied to information or social phenomena. Nicholls^* (1987) applied many 
methods for estimation of Zipf parameters. These included linear least squares 
(LLS), ma.ximuni likelihood (MLE), ratio of frequencies (RAT), minimum chi- 
square (MIN), method of moments (MOM) and a truncated least squares method. 
Apostolos & Li (1997) proposed a novel metric for the evaluation of the goodness- 
of-fit criterion between the distribution functions of two samples. They extended the 
usage of the proposed criterion for the case of the generalized Zipf distribution. 
According to them. Since the Zipf distribution of a document employs the 
frequencies of the words forming that particular document; it is justified to evaluate 
the contextual similarity based on the numerical encoding produced by the 
particular distribution’. Egghe*^ (1999) studied the probabilities of the occurrence 
of multi-word (m-word) phrases (m=2, 3 ...) in relation to the probabilities of 
occurrences ul’llie single words, They found lluil in (lie ladercase, (he law ofZipfis 
valid. 

Applications of Zipfs Law 

Many researchers have applied Zipfs law in city populations. Hill'^ (1970) applied 
Zipf s law for the composition of a population. He found that limiting distribution of 
frequencies as the population size become large; the limiting distribution gets a 
weak form of Zipfs law. Makse^*’ et al. (1995) modeled urban growth patterns and 
used Zipfs law. Kruginan^^ (1996) used Zipfs law for the Self-Organizing 
Economy. Zanette^® et al. (1997) developed a model of a large-scale city formation. 
Manrubia*^’ et al. (1998), developed an intermittency model for urban development 
Marsili & Zhang^“ (1998) modeled interacting individuals and commented, “In 
many disparate societies, it is not unnatural to assume that individuals make their 
city-dwelling decision based on their own opinions as well as on their interaction 
with other citizens”. They found that the larger cities obey approximately Zipfs law. 
Gabai.x^' (1999) gave an explanation for Zipfs law for cities. Reed^^ (2002) 
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analyzed the rank-size distribution for human settlements on the basis of simple 
stochastic models and found that model explains the rank size phenomenon in the 
upper tail. According to Rccd^^ (2001), “It has long been recognized that the 
distribution of size (human population) of cities within a particular country or 
jurisdiction frequently exhibits Paretian behaviour in the upper tail. This 
phenomenon is known as the rank size property or in the case when the Pareto 
exponent is unity as Zipi’s law. There have been many attempts to explain this 
phenomenon”. Knudsen^'* (2001) found that the growth pattern of Danish 
production companies follows a clean rank-size distribution consistent with Zipfs 
law. They tested the existence of Zipfs law on 14, 541 Danish production 
companies and found answers to three basic questions like does the Danish case 
refute Zipfs law for cities, what arethe implications of Zipfs law for models of 
local growth? And do we have a Zipfs law for firms? Based on empirical data they 
found that the growth pattern of Danish production companies follows a clean rank- 
size distribution consistent with Zipfs law. Marsili and Zhang^*^ (1998) presented a 
general approach to explain the Zipfs law of city distribution. They commented, “If 
the simplest interaction (pair wise) is assumed, individuals tend to form cities in 
agreement with the well-known Statistics”. According to them, the interaction 
leading to Zipl's law is, on one hand, the simplest possible (pair wise interaction). 
On the other it is a rather special one, since it is the “lowest order” of interaction 
which does not lead to the formation of a mega city, which draws a good portion of 
the whole population. Urzua^^ (2000) presented a simple and locally optimal test for 
Zipfs law and illustrated its use in the case of the largest US metropolitan areas. He 
commented, “the log of the Zipf variate x/p. follows an exponential distribution with 
mean equal to one, while its inverse follows a uniform with mean equal to one-half. 

Many scientists have attempted to c.xamine the informetric properties of the web in 
the past. Adamic & Huberman^* (2002) claimed the Zipfs law governs many 
features of the Internet. It has implications for the design and function of the 
Internet. The connectivity of Internet routers influences the robustness of the 
network while the distribution in the number of email contacts affects the spread of 
email viruses. Even web caching strategies are formulated to account for a Zipf 
distribution in the number of requests for web pages. According to Adamic & 
Hubennan'’'’ (2002), the Internet is comprised of networks on many levels, and some 
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of the most exciting consequences of ZipPs law have been discovered in this area. 
The distribution of the number of computers a computer has connections to, is a 
Zipf distribution. The presence of Zipfs law has implications for the search 
strategies used in P2P networks. Knowledge of ZipTs law in the connectivity 
distribution has offered a solution to an internet communication problem. According 
to Shi et al. (2006), “Zipfs law (Zipf-like law) holds the promise of more 
effective design and use of Web cache resources. Ongoing work includes the 
application of the work studied in this paper and the study of the Web prefetching 
model based on the Zipfs law”. Rousseau^® (2001) has tried to analyze a time series 
of the number of hits of word “Euro” on the web during a period of one year. Lee 
Breslau^'^ et al. (1999) raised an issue that whether web requests from a fixed user 
community are distributed according to Zipfs law. They found that the page request 
distribution seen by web pro.xy caching using traces from a variety of sources does 
not follow Zipfs distribution precisely, but instead follows a Zipf-like distribution 
with varying exponents. Chao & D’haeseleer‘‘° (2001) attempted to find the 
liisiribulion of Variable lengih Pliatic inlcrjcclivcs on Ihc World Wide Web. They 
found that the number of pages found containing these words would fall off as a 
power law. However the exponents for length frequency distributions of different 
intcrjcctives were much larger than -1 predicted by Zipfs law. There are many 
other instances of Zipfs law in Web Access Statistics and Internet Traffic like 
caching relay for the world wide web (Glassman‘‘^ 1994), Internet web server 
(Arlilt‘“ ct al. 1997), World Wide Web traffic (Crovclla'*^ et al., 1997), power laws 
in designed systems like Internet traffic (Carlson'^'*, 2000) and nature of markets in 
the World Wide Web (Adamic^^ et al. 2000). According to Chen & Wu"*^ (1997), 
“Many models have been developed to predict a software system’s failure rate and 
were used as management tools to evaluate software reliability. Software failure 
processes can be modeled by non-homogeneous Poisson process (NHPP), which 
was originally used to analyze the hardware failure data. The Duane model is a 
well-known NHPP model based on power law failure rate for analyzing hardware 
reliability”. They proposed a model is propo-sed based on Zipf’s law for software 
reliability analysis and observed that the proposed model has better long-term 
predictability than the Duane model for failure data sets with power law's failure 
rates. 
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Zipfs law has applications in finance and business also. If the distribution is not 
plotted as the rank-frequency plot, but the number of companies in each revenue or 
sales then Zipfs is observed. Champernowne*^^ (1953) presented a model of income 
distribution. Mandelbrot‘S^ (1963) discussed new methods in statistical economics 
and also in his book Fractals and Scaling in Finance: Discontinuity, Concentration, 
Risk (Mandelbrot‘S^, 1997). Aoyama'S^ et al. (2000) applied Pareto’s law for income 
of individuals and debt of bankrupt companies. Rced^^"^ (2001) commented that 
many empirical size distributions in economics and elsewhere exhibit power-law 
behavior in the upper tail. The prime examples are distribution of incomes (Pareto’s 
law) and city sizes (Zipfs law on rank-size property). Stanley^^ et al. (1995) related 
Zipfs plots and the size distribution of firms. Axtell^~ (2001) also showed that the 
Zipf distribution characterizes firm sizes; the probability a firm is larger than size s 
is inversel} proportional to s. According to them. The Zipf distribution is an 
unambigLious target that any empirically accurate theory of the firm must hit. This 
result places important limits on models of firm dynamics. Because the Zipf 
dislrihutioii obtains all tlic way down to tlic smallest si/cs. it should be possible to 
derive Kesten-lype processes and, hence, the Zipf distribution forms a 
microeconomic model in which individual agents interact to form productive teams. 
Bence & Oppenbeim” (2004) tested the group of 1489 titles for a Bradford-Zipf 
distribution and posed a tiueslion whether Bradlbrd-Zipf apply to business and 
management journals in the 2001 Research Assessment Exercise? According to 
them, “Zipfs Law describes the frequency distribution of words in a given text, 
with familiar words being used many times and many words being used only once. 
Bradford’s and Zipfs laws have been shown to be mathematically identical and so 
the distribution is often referred to as the Bradford-Zipf distribution”. Choi^'* et al. 
(2005) investigated the rank distribution and the cumulative probability for stock 
prices, and the probability density of price returns for stocks traded the Korean 
Stock Exchange (KSE) and the Korean Securities Dealers Automated Quotations 
(KOSDAQ) market. According to Choi^'' et al. (2005), “the ranks for stock prices 
traded on the KSE, the KOSDAQ, and the TSE follow Zipfs law or a power law 
while that of the NYSE follows a power law”. As per Samuelsson^^ (1996), Zipfs 
Law is also closely related to the Good-Turing smoothing technique, and a better 
law could lead to better smoothing. He showed that Zipfs Law implies a smoothing 
function slightly different from Good-Turing. 


26 


Zipl s law is applied in many other areas like ecological systems, genomic data, 
earthquakes and clinical diagnosis etc. Hill^* (1974) proposed a modification 
involving cl.issillcation ol species into a (amily and llicn into genera within I’aniilics. 
Reed et al. (2002, 2003, 2004) presented models for the size-distribution of 

forest fires, distribution of family names and size distribution of gene and protein 
families. It was applied in the distribution of large earthquakes (Sornette^'^ et al., 
1996). Li et al (2002) applied Zipfs law in importance of genes for cancer 
classification using micro array data. Tachimori“ et al. (2002) analyzed the 
fiequcncN ol clinical diagnosis and found that inverse power relationship between 
the lank ouler ol diagnosis and the frequency of the appearance of these diagnoses 
e.xists. They found that both group types have the inverse-power relationship 
between the rank order of diagnoses and the frequency of the appearance of these 
diagnoses. (This relationship is called Zipfs law, which is observed in natural 
language). They found that, “in addition to the clinical diagnoses, medical indices 
such as aver age length ol hospital stay, frequencies of medical treatments expressed 
in terms of ' CD9-CM (International Classification of Disease 9"' Revision, Clinical 
Modification) and medical fees, also follow Zipfs law”. He proved that the 
diagnostic sets based on the doctor's diagnoses followed Zipfs law. They further 
commented, “I'he indication that diagnostic sets observe Zipfs law may possibly 
have major effects on changing the conventional concept of diagnostic frequency 
rate”. There are many more examples like Zipfs law in percolation (Watanabe^, 
1996), in immune system (Burgos^ et al. 1996), in liquid gas phase transition of 
nuclei (Ma*^, 1999) and in psychiatric ward (Piqueira^^ 1999). 

There have been many applications of the law in natural languages, like English 
(Miller*’^ et al. 1958), Chinese (Rousseau*^ et al, 1992), Voyanich manuscript 
(Landini™, 1997), etc. However, there are few applications of the law to random 
texts. Li’’ (1992) showod that the Zipfs law is applicable to random texts provided 
it has a very different word structure and length distribution than a natural language. 
Losee^^ (2001) provided an information theoretic interpretation of Zipfs Law, a 
power law. Using the regularity noted he suggested that Zipfs Law is a 
consequence of the statistical dependencies that exist between terms, described here 
using information theoretic concepts. He found relationships between the 
frequency-based characteristics of neighboring terms in natural language and the 
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rank or frequency of the terms. Given the term rank or frequency, he inferred about 
the entropy , or average information, of a terra or a group of terms. The amount of 
information that one term has about another depends on the rank of one of the terms 
and of the rank or frequency of the term pair. Using these relationships, he offered a 
partial explanation of why Zipfs Law occurs as it does. According to Ferrrer-i- 
Cancho & So!e'^ (2002). "random texts lose the Zipflan shape in the frequency 
versus rank plot when words are restricted to a certain length, which is not the case 
in real texts. It is thus clear that monkey languages' partial validiu' relies on their 
word length distribution, w'hich we have indicated is unrealistic. These results 
suggest that future theories of language origin should be able to explain the origin of 
Zipfs law. instead of using it as a given constraint". 

Zipfs law in literatures 

Zipfs law postulates that the frequency of occurrence of any word as a function of 
rank follow s a power law with exponent close to unity. It has been applied to many 
areas like natural languages, monkey-typing texts, web-access statistics, 
informetrics. finance and business and ecological systems, etc. There is evidence of 
differences on whether the power law' embedded in Zipfs law is actually a Yule 
distribution (Martindale'. et al. 1996), lognormal distribution (PerlineU 1996) or 
stretched exponential distribution (Laherrere^. et al, 1998). 

Ferrer-i-Cancho & Sole^'* '^ (2001) commented that ZipFs law has been a popular 
achievement of quantitati%e linguistics. Zipf s appears to be robust. Many models of 
SN ntactic communication assume this law. It is an obvious ingredient for any theory 
of language evolution. A complete theory of language requires a theoretical 
understanding of its implicit statistical regularities. According to them, "Words in 
human language interact in sentences in non-random ways, and allow humans to 
construct an astronomic variety of sentences from a limited number of discrete 
units. This construction process is extremely fast and robust. The co-occurrences of 
word in sentences reflect language organization in a subtle manner that can be 
described in terms of a graph of word interactions”. 

According to Ferrer-i-Cancho^* (2005), “Given the apparent universality of Zipf s 
law' and also the enormous differences between all languages on Earth, it is 
tempting to think that its explanation has nothing to do with language.... Zipf s law 
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for word frequencies could be the manifestation of a complex system operating 
beUveen order and disorder”. According to Miller & Chomsky” (1963), “The 
occurrence of ZipF s law does not constitute evidence of some powerful and 
universal psychological force that shapes all human communication in a single 
mould . ZipFs law doesn’t manifest at a higher level of semantic cognition where 
language appears compressed. Therefore, Zipf s law can be rooted in a language 
structuring process of coding, which adds redundancy necessary for language 
understanding. Zipt s Law provides a base-line model for expected occurrence of 
target terms and the answers to certain questions may provide considerable 
information about its role in the corpus (Steele^* et al., 1998). Zipf s Law provides a 
distributional foundation for models of the language learner’s exposure to segments, 
words and constructs, and permits evaluation of learning models (Brent” 
1997). According to Powers®’ (1998), “ZipFs theory requires etYort to be constant 
independent of frequency, however Information Theory a.nd Psychological 
experiments both indicate that this ought not to be the case, and that it in fact 
decreases in a way consistent with an optimal strategy for an unbounded lexicon”. 

According to Powers®’ (1998), "Zipf considered that the speaker had to build a 
continuous stream of specified products, that is an ongoing stream of utterances 
conveying specified meanings, in such a way as to minimize his effort as speaker 
consistent with effective communication to the hearer, her task being simplified as 
the relationship between utterances and meanings approached one to one: the work 
involved in producing a construction consists of the work invoh ed in fetching the 
tool, which is directly in proportion to the cost of fetching the tool and includes both 
the mass of the tool, m. and the distance, cl that it needs to be fetched, given 
increasing either increases the effort required”. According to Ferrer-i-Cancho & 
Sole”' (2003), "the early hypothesis of Zipf of a principle of least effort for 
explaining the law is shown to be sound. Simultaneous minimization in the effort of 
both hearer and speaker is formalized with a simple optimization process operating 
on a binary matrix of signal-object associations. ZipFs law is found in the transition 
between referentially useless systems and indexical reference systems. We strongly 
suggest that ZipFs law is a hallmark of symbolic reference and not a meaningless 
feature”. 
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According to Le Quan (2002), “Zipf discovered the law by analyzing manually 

the frequencies of words in the novel Ulysses by James Joyce. It contains a 
vocabulary of 29,899 different word types associated with 260,430 word tokens”. 
They found that for single words ZipPs law is valid only for high frequency words. 

■ ZipPs la%v performs better on the data containing single word and n-gram phrases 
combined together. It works for the low frequencies also across languages. 
According to Li*^ (2002), one of the many phenomenons using ZipP s law pattern is 
word usage in human languages. The number of times a word is used in written 
human languages and the frequency of usage are the variables that indulge in a 
ZipPs type distribution. This phenomenon can also be extended to spoken 
languages, non-English or non-Latin languages, combination of words, etc. 

Smith & Devine*'* (1985) found that legal texts also follows ZipPs law but in a little 
different manner. They showed that lawyers use more words than other people. 
Francis & Kucera*’'' (1964) applied the ZipPs law to the Brown corpus of 1 million 
words of .American English. A corpus is a body of naturally occurring text, stored in 
a machine-readable form. Le Quan Ha*^ et al. (2002) analyzed ZipP s law for large 
corpora in two languages, English and Mandarin. The English corpora used in their 
experiments are taken from the Wall Street journal and the Mandarin corpus used in 
their experiments was the TREC Corpus obtained from the People'’s Daily 
Newspaper from 01/1991 to 12/ 1993 and from the Xinhua News Agency for 
04/1994 to 09/1995 from the Linguistic Data Consortium. Wang*'' ii989) presented 
ZipPs distribution of Chinese corpus and Wyllys*’ (1981) took a data set of 3907 
English words. Sun** et al. (1999) proposed a simpler model for estimating the 
frequencv of any same-frequency words and identifying the boundary point between 
high-frequency words and low-frequency words in a text. The model w'as based on 
the maximum ranking method and it ranked w'ords and estimated word frequency 
with the help of a formulae. They commented, “Studies of word frequency have 
many interesting and potentially signiPtcant applications. For example this mode! 
could be used to evaluate a single article or an author’s work. Assuming a 
reasonable level of skill among the writers whose works are the basis for our 
observations, we can use this model as a benchmark for assessing writer’s language 
skills”. According to Pinker & Bloom*^ (1990), “Many authors have pointed out 
that tradeoffs of utility concerning hearer and speaker needs to appear at many 
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levels. As for the phonological level, speakers want to minimize articulatoiy effort 
and hence encourage brevity and phonological reduction. Hearers want to minimize 
the effort of understanding and hence desire explicitness and clarity”. 

^ Ferrer-i-Cancho & Sole*' (2003) commented that the effort for the hearer has to do 
with determining what the word actually means. The higher the ambiguity (i.e. the 
number of meanings) of a word, the higher the effort for the hearer. Besides, the 
speaker will tend to choose the most frequent words. The availability of a word is 
positively correlated with its frequency. Gernsbacher^® (1994) called this 
phenomenon as the word-frequency effect. Thereafter, the speaker tends to choose 
the most ambiguous words, which is opposed to the least effort for the hearer. Zipf 
referred to the lexical tradeoff as the principle of least effort. He pointed out that it 
could explain the pattern of word frequencies, but he did not give a rigorous proof 
of Its validity. Word frequencies obey Zipfs law. If the words of a sample te.xt are 
ordered by decreasing frequency, the frequency of the A"’ word, P{k), is given by 

P(k)xk , with or » 1(1 1). According to Balasubramaniyan & Narayan^' (1996) 
this pattern is robust and widespread. 

According to Deacon‘d- (1997), “This might explain why human language is unique 
with regard to other species but not only so. One-to-one maps between signals and 
objects are the distinguishing feature of index reference. Symbolic communication 
is a higher-level reference in which reference results basically from interactions 
between signals. Zipfs law appears on the edge of the indexical communication 
phase and implies polysemy. The latter is the necessary (but not sufficient) 
condition for symbolic reference”. 

Situngkir"" reported the statistical observation of Zipfs law to different human 
languages while the approached corpus is being telling the same things. This is 
expected to reduce the possible sensitivity to the meaning of the texts and the 
different stylized statistics are closer to what emerging from the respective structure 
of language, whether it grammatical or lexical. Interestingly, it has also been 
showed that Zipfian statistics is robust throughout those raw corpuses analyzed. 

According to Stewart'^'' (1994), Zipfs law was developed to describe the frequency 
of word use in documents. He applied Zipfs law to three classic sets of word 
frequency data: Eldridge’s distribution of word usage in four American newspaper 
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articles, Brugmann’s study of four plays in Plautine Latin, and noun frequency in 
Macaulay’s essay on Bacon. He obtained excellent fits to these data sets. 

Recent Corpora 



Cdrpis 

Size 

Domain 

Language 

NA News Corpus 

600 million 

Newswire 

American English 

British National Corpus 

too million 

Balanced 

British English 

EU proceedings 

20 million 

Legal 

10 language pairs 

Penn Treebank 

2 million 

Newswire 

American English 

Broadcast News 


Spoken 

7 languages 

SwitchBoard 

2.4 million 

Spoken 

American English 


Table 2.1: Some examples of recent corpora 


For more corpora, the Linguistic Data Consortium at httr):/nv\v\v.ldo.iiDenn.edu/ can 
be visited. 

Gelbukh and Sidorov'’^ (2001) observed that the coefficients of Zipf law are 
different for different languages. They illustrated this through English and Russian 
e.xamples. It is important to reason this as it may have some implications on the 
nature of language. They further commented that performance of Zipf s law is 
different in these languages as “Russian is a highly inflective language while 
English i.s anahtical. Spanish, having “inflectivity” intermediate between Russian 
and English, shewed intermediate results as to the coefficients. The other aspect is 
that lexical richness of Russian is greater than that of English (and Spanish)". 
Ferrer-i-Cancho and Sole^^ (2001) showed that the co-occurrence of words in 
sentences relies on the network structure of the lexicon. They analyzed the 
properties in depth and commented that human language can be described in terms 
of a graph of word interactions. 

Turner^* (1997) investigated relationship between vocabulary', text length and Zipf s 
law. He tried to relate the rate at which previously unused words w ere added to a 
text as an author increased its length. The question was whether there exists a 
relationship between vocabulary and the text length. He has chosen four texts viz. 
two Shakespearean plays- Anthony and Cleopatra <& Richard 111 and tw'o novels 
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Withering Heights by Emily Bronte, Seme and Sensibility by Jane Austin. He found 
that the rate at which new words are added to the text as its length increases follows 
a power law. The rate is lower than that from Zipf s law for both novels and plays. 
The rate came out to be approximately two thirds for plays and a half for novels. He 
also commented that a deviation of actual text from Zipf s law distribution needs 
classification and explanation. 

Landini’*^ (1997) applied Zipfs law in Voynich manuscript (The mysterious 
manuscript which is still unread). It is still not known that in which language it is 
written, about its alphabets and abbreviations etc. However it was observed that 
rank frequency and length frequency are still present in this manuscript. Li” (1998) 
showed that Zipfs law is applicable in random texts also, . however such random 
texts should have a very different words and te.xt length distribution than a natural 
language. Perhaps, this is the reason for this law appearing in natural languages 
different from those in random collection of characters. Zipf searched for a principle 
of least effort that would e.xplain the equilibrium between uniformity and diversity 
in usage of words. Most others searched for a probabilistic explanation. The burning 
question still remains- Do we have any new evidence that Zipfs explanation of 
principle of least effort is more correct than a statistical explanation? 

There have been many applications of the law in natural languages, like English 
(Miller*’’ et. al. 1958), Chinese (Rousseau^’ et al. 1992), Voyanich manuscript 
(Landini'*^. 1997), etc. However, there are few applications of the law to random 
texts. Lf ■ (1998) showed that the Zipfs law is applicable to random texts provided 
it has a veiy different word structure and length distribution than a natural language. 

To investigate more into this area, Sa.xena'°° et al (2004) selected a random text and 
tried to find clues on the distribution of rank and frequency. An anempt has been 
made to e\olve a new ranking method, based on tied-ranks and a comparison has 
been made with the random rank method, deployed by Zipf^^ (1949) and maximum 
rank method, deployed by Chen & Leimkuhler^* (1987). .According to 
Mandelbrot’ (1953), “The monkey language is, in the terminology of fractal 
geometry, self-similar and grows on infinite trees (any branch of the tree will be 
identical to the tree itselO- thus needing an infinite dictionary. A natural language 
like English, on the other hand, is a massively geared down system that economizes 
on entropy in a number of ways, e.g., the interdependence — or redundancy — of 
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words that seems necessary in order to make a text “meaningful." Most letter 
combinations (an uncountable set) in English are non-words”. However, the random 
te.xt taken for analysis in this communication is called “random” only because 
though it is in English, it follows a very subject specific usage of word.s, e.g. use of 
hyphenated words. Hence, in this communication, the random text used, differs 
from monkey typing text by only one virtue, i.e. every word in this random text has 
a dednile meaning. 

Chao & D’hacsclecr'’'^ (2001) attempted to find the distribution of Variable length 
phatic interjectives on the World Wide Web. They found that the number of pages 
found containing these words would fall off as a power law. However the e.xponents 
for length frequency distributions of ditYerent interjectives were much larger than —1 
predicted by Zipfs law. Parunak''^^ (1979) developed a data-analytic technique and 
applied it on count of words from large te.xts in Greek, French and English. 
According to them, “Counted data, whether the number of word§ in a text or the 
number of animals of various species in a population, often lacks the usual forms of 
structure... .It is how-ever possible to rank the counts from most frequent to least 
frequent. The resulting frequency distributions are usually very long tailed and they 
follow a fairly regular pattern, which is approximated by the rank-size rule”. It was 
found out that word I'requency distributions are dependent on word-length. 

Sen'" ct ai. (1998) investigated the application of Zipfs law on technical writing. 
They commented that “technical writing differs from literary or ordinary writing in 
a number of ways. In technical writing, more often than not, each term represents a 
particular concept which is used again and again whenever the author refers to that 
concept thus leading to the increase in the frequency of its use”. It was found that 
the LIS writings also follow the Zipfs law when only the textual part of the writing 
is considered omitting alpha-numeric and alpha-symbolic expressions, 
abbreviations, heading of illustrations, intra-text references, words figuring within 
table and keywords. 

There are other instances of Zipfs law in natural languages like Dahl''^ (1979) 
analyzed word frequencies of spoken American (Verbatim). He found that the top 
twenty w'ords were: I, and, the, to, that, you, it, of, a, know, was, uh, in, but, is, this, 
me, about, just, don't etc. A similar result has been obtained by Ferrer-i-Cancho & 
Sole‘°* (2001). They commented that the so-called particles, a subset of the function 
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words (e.g. articles, prepositions and conjunctions) which are used for speeding-up 
the navigation forms the most frequently occurring words. The top ten were found 
as “and”, “the”, “of’, “in”, “a”, “to”,”s”, “with”, “by”, and “is”. 

Ridley & Gonzales'^^ (1994) analyzed adult speech and applied Zipfs law to small 
samples ofadult speech. Balasubrahmanyan & Naranan'^' (1998) described models 
for power law relations in Linguistics and Information Science. Sen‘°^ et al. (1998) 
conducted a study that indicated that the technical writings such as LIS writings also 
follow the Zipfs law' when only textual part of the W'riting is considered omitting 
alpha-numeric and alpha-symbolic e.xpressions, abbreviations, headings of 
illustrations, intra-te.xtual references, words and figuring within tables, keywords. 
Egghe’"* (1999) applied this law for multi-word phrases. Prun'“ (1999) illustrated 
Zipf s conception of language as an early prototype of synergetic linguistics. 

Le Qiian Ha**^ et al. (2002) found a confirmation of Zipfs law in the extended form. 
They found that n-gram word phrases as well as single words follow Zipfs law 
accurately. They verified this result valid for five languages viz. English, Mandarin, 
Irish, Latin and Vietnamese. Martynyuk''’’ (2006) applied Zipf s law on Hindi and 
Urdu texts. According to Martynyuk‘°^ (2006), “Statistical regularities are the basis 
of the structure of the vocabulary of any language or text. Zipfs law is a reflection 
of a specillc property of the organization of human memory, which usually operates 
with more frequent language units in all ca.ses of the spontaneous use of speech”. 
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Research Methodology 


Objectives 

The objectives of the present work are: 

• To find the interrelationships between the rank and the frequency of a word in 
selected literatures. 

• To test whether the Zipf s law can be applied in these literatures. 

• To do an inter-literature comparison of the applicability of Zipf s law. 

• To do mathematical modelling & validation of the model through the collected 
data. 


Hypothesis 

The hypotheses of this work are that 

1. The principle of least effort is a universal phenomenon, 

2. All writers would follow an economy in the use of words irrespective of the language 
concerned, 

3. The rank-frequency distribution of words would be similar in all languages. 


The Data 

For inter-literature comparison of the applicability of Zipf s law, we have selected the 
following set of te.xts from diverse literatures. Thus, we have selected the following 3 1 
sets of text from diverse literatures. 

o Computer Science : A text of 10,043 words from a computer science “Operating 
System - Concepts and Design", by Milenkovic', Second edition, 1997 ( Tata 
McGraw Ffill, New Delhi ). 
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0 Hindi literature : i he IIT Kanpur’s e-text of roman version of "'Eidgaah” by Munshi 
Frem Chand". ( hltp://\v\v\v.muns i pretnchand.iitk.ac.in/author.html ). This website has 
been built as part ol a larger effort to create a series of websites based on Indian 
philosophical texts. This website has been built under a project in the Department of 
Computer Science & Engineering at the Indian Institute of Technology Kanpur. (File: 
eidgaah.txt) 

o English : The Project Gutenberg e-text of “Aladdin and the Wonder Lamp”, a "public 
domain" work distributed by Professor Michael S. Harf^ through the Project 
Gutenberg A: sociation. Project Gutenberg is the oldest producer of free e-books on 
the Internet ( h it p :// \v w w . u ule nberu.oru/ ). (File: aladdin eng.txt) 

o German : The Project Gutenberg e-text of “Aladdin und die Wunderlampe”, by 
Ludwig Fulda'*, with original illustration by Max Liebert. Project Gutenberg is the 
oldest producer of free e-books on the Internet (httD://\vwv\ .uutenheru.oru/ ).(Fiie: 
aladdin ger.txt) 

0 Libran' Science : The Project Gutenberg e-text of “The Librar>'”, b>’ Andrew Lang"'’ 
#20 in our series by Andrew Lang, December, 1999 (File: librarys.txt) 

o Sanskrit : 'I'he Project Gutenberg e-IIook ol'“Sri Vi.shnu Sahasranaamam”, by 
Unknown. It is in Sanskrit and character set encoding is US-ASCII. This E-text was 
transcribed by N. Srinivasan & Karthik Krishnan^ and formatted by Maitri Venkat- 
Ramani. This e-text can be transliterated in Sanskrit using the ITR/\NS processing 
tool at the following location, (idle: sanskritwork.txt). 
http://sanskrit.ude.io/processinu tools/processinu tools-html 

o A Language of India : For tiii.s portion wc have taken an e-text from the English 
version of the collection of Ghazal “3isat-e-Hyder” by Hyder Zaheer Ansari Kyder^. 
(http://www.bisatehyder.indiaaccess.coraO (File: urdu.txt) 

o A hook on Thesaurus & their suciusings or a dicti(uiiir >': The ih'ojcct Gutenberg ll- 
tcxl of “Mr. Honey's Small Business Dictionary (English-German)” by Winfred 
Honig*. Mr. Honey (Winfred Honig) compiled English/German dictionaries tor 
almost 3 decades to provide his colleagues and students with samples of the language 



File Name Appendix 
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BO Text from 

j Coiiipiilci' 
Science 


( Iperiiiiiig Sysiciii - Concepts iind Design l> 

Miinfi Milniiknvir. 


eidgaah.txt 4951 

aladdin eng.txt 53 19 


Hindi literature Eidgaah By Munshi Prem Clnnd 


English Aladdin and the Wonder Limp 


4 German 


Library 

Science 


6 Sanskrit 


7 Urdu 


Aladdin und die Wunderiainpe 


The Library by Andrew Lang 


Sri Vishnu Sahasranaamam 


Bisat-e-Hyder by Hyder Zaheer Ansari Hyder. 


aladdinger.txt 17686 


librarys.txt 37498 


sanskritwork.txt 1411 


urdu.txt 


Diclionaiy/ Small Business Dictionary (English- Eng-ger- 

Thesaurus German) by Winfred Honig busDictionaiy.txt 

Table 3.1: Description of first set of documents 
The descriptions of other texts are as follows: 
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of business ind highlight the need for special dictionaries covering the special 
language used in different branches of the industry. These wordlists are now fed into 
the LliO Online Dictionary ( hitn://dict.leo.oru) and the DicData Online Dictionary 
(http :// WWW ■ d i edata ■ de ) (File: Eng-ger-busDictionary.txt). 

Apart iroin these we htive taken many texts over a period of lime and some poplular e- 

texts from Project Guttenberg database and Infomotions etc. 

o E-texts over time: Public domain electronic texts (e-texts) ^ in the areas of American 
and English literature as well as Western philosophy are taken in this category. These 
were "classic" texts that have stood the test of time. They also encompass a huge time 
period- as far back as 400BC to the present. (http:/Avvvw’.infomotions.com/etexts/ ) 

o Popular e-texts: Popular e-texts like “365 Foreign Dishes”, “The Arabian Nights 
Entertainments”, “The Arctic Queen” and “The Atomic Bombings of Hiroshima and 
Nagasaki” were also taken to investigate the relationship. 

The following is a description of these texts: 
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no 

Text from 

Title 

No- of 
Words 

Appendix 

1. 

American Literature 
1700-1799 

The Autobiography of Benjamin Franklin 

68157 

19 

2. 

American Literature 
1800-1899 

Autobiography by Thomas Jefferson 

1743 - 1790 (With the Declaration of 
Independence) 

40648 

21 

3. 

American Literature 
1900-1999 

Torn Sawyer, Detective By Mark Twain from 
"The Writings of Mark Twain, Volume XX 

24486 

30 

4. 

English Literature 
700-799 

Beowulf, ftoni The Harvard Classics, Volume 49 

27129 

1 1 

5. 

English Literature 
1200-1299 

The Canterbury Tales by Geoffrey Chaucer 

99403 

13 

6. 

■ 

English Literature 
1500-1599 

Romeo and Juliet by Shakespeare 

26784 

29 

_ 

English Litera ure 
1600-1699 

The Pilgriiifs Progress, by John Bunyan 

57122 

9 

8. 

English Literature 
1600-1699 

Hamlet by Shakespeare 

33098 

28 

9. 

English Literature 
1700-1799 

The Wrongs of Woman by Mary Wollstonecraft 

45874 

31 

10 

English Literature 
1800-1899 

A Christmas Carol by Charles Dickens 

21818 

15 

1! 

English Literature 
1800-1899 

Endymion: A Poetic Romance by John Keats 

31962 

22 

12 

English Literature 
1900-1999 

Peter Pan by James M. Barrie 

47885 

10 

13 

Western Philosophy 
400BC-301BC 

Meteorology by Aristotle 

43470 

6 

14 

Western Philosophy 
lOOBC-lBC 

On The Nature of Things by Titus Lucretius Carus 

75386 

25 

15 

Western Philosophy 
400-499 

Confessions and Enchiridion by Saint Augustine 

176014 

8 

16 

Western Philosophy 
1600-1699 

Concerning Civil Government, Second Essay- An 
essay concerning the true original extent and end 
of Civil Government, by John Locke, Chapter I 

53786 

24 

17 

Western Philosophy 
1700-1799 

A Treatise Concerning The Principles of Human 
Knowledge by George Berkeley 

36342 

12 

18 

Western Philosophy 
1800-1899 

The Subjection of Women by John Stuart Mill 

45240 

26 

19 

Western Philosophy 

1 900-Present 

A Young Girl's Diary 

Prefaced with a Letter by Sigmund Freud 

72133 

20 

20 

Popular 

365 Foreign Dishes, by Unknown 

27891 

1 

21 

Popular 

The Arabian Nights Entertainments, by 

Anonymous 

90768 

4 

22 

Popular 

The Arctic Queen, by Unknown 

16703 

5 

23 

Popular 

The Atomic Bombings of Hiroshima and 

Nagasaki by The Manhattan Engineer District 

25341 

7 


Tabic 3.2: Description of second set of documents 
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The Software 


We have tried nuch software for calculating the word frequency from the text. We 
searched the World Wide Web (www) for freeware or shareware, which can do this 
work. We found four major software. These were Hermetic Word Frequency Counter 
5.32, Textanz Word and Phrase Frequency Counter v.1.3. Fore Words Pro 1.2.0.41 and 
TextSTAT. We tried to analyze various text files with these software. The first three 
software calculated the frequencies but since we were using the demo version, we faced a 
major limitation of not been able to transfer the output to a file. We therefore switched to 
TextSTAT'® which is completely free software. Thus the Software for calculating the 
word frequency from the texts used in this work is “TextSTAT”. Shown below is a screen 
shot of TextSTAT. 
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Figure 3.1.' A screen shot of TextSTAT 


TextSTAT is a simple program for the analysis of texts made by Free University of 
Berlin. It produces word frequency lists and concordances from ASCII/ANSI texts, MS 
Word and HTML files. TextSlAT can be downloaded from the website 
http://wmv.niederlandistik.fu-berlin.de/textstat/software-en.html . 

All unique words were ranked at random according to their frequency of occurrence in a 
decreasing order. DilTerent ranks were assigned to each of them according to Zipfs 
approach of random-ranks. i ■ 
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Ranking Method 

Zipf'^(1949) used random rank approach i.e. words arranged in decreasing order of 
frequency and ranks allotted in ascending order. In this way the word with maximum 
frequency will get rank 1 and so on. This leads to steps for large values of rank. This is 
one of the disadvantages with the random rank method. Chen and Leimukuhler" (1987) 
had overcome this problem, by using the maximum rank for all the words with the same 
rank. Also, their method helped in preserving the convertibility between frequency-rank 
distribution & frequency-count distribution and vice-versa, which was not possible in 
random rank approach. Another method proposed by us is based on the concept of ' ties , 
which means, that if two observations are tied, i.e. they have the same frequency then 
they should be i.ssigned the ranks according to the average of their random ranks. This 
was done in order to stabilize the product of frequency and rank. This method is 
demonstrated with the data for the computer science literature. However, in all other texts 
the random rank approach of Zipf will be applied. 

Analysis 

All unique words were ranked at random according to their frequency of occurrence in a 
decreasing order. Different ranks were a.ssigned to each of them according to Zipfs 
approach of random-ranks. We then found out the rank Irequency g(r) i.e. the number of 
words of the same rank. This was done in order to obtain the product r x g(r). Here, r is 
the word rank and g (r) is the rank frequency i.e. the number of words of the same rank. 

Microsoft Excel has been used extensively to “sort” the data in the first place and 
“advanced filter” feature of the Excel is used to filter out the unique frequencies. 

A brief description of these features is given below: 

Sort : Sort rows in ascending order based on the contents of one column. Alphabets are 
sorted in ascending alphabetic order and numbers are sorted from lowest to highest value. 

Advanced Filter : Advanced filter criteria can include multiple conditions applied in a 
single column, multiple criteria applied to multiple columns, and conditions created as 
the result of a formula. It can filter out unique records from a column. 

Count If : Counts the number of cells \vithin a range that meet the given criteria. 
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Once the Zipf an data has been obtained for the various files we have calculated the log 
(Base 10) values for the rank and rank frequency. Regression analysis and curve-fitting 
was done on this data. A linear fit was done in order to find the applicability of Zipf- 
Mandelbrot law. Mandelbrot assumed that the aim of language is to transmit the most 
information per symbol with the least effort. He proposed the following relationship: 

f = k{r + c)"^ i 

Where, f is the frequency and r is the rank of the word; c and 6 are constants. Here, c 
improves the fit for small r and the exponent 9 improves the fit for large r. A data follows 
Zipfian distribution if the exponent 6 remains close to -1 . 

We have used various statistical packages like SPSS, Minitab and Curve Expert to carry 
out these analyses on the selected texts. 


A note on Nonlinear Regression 

Nonlinear regression fils a mathematical model to data. A mathematical model is a 
simple description of a state or process, a model can helps in designing better 

experiments and make sense of the results. According to Levins^s (1966), "A 


mathematical model is neither a hypothesis nor a theory. Unlike scientific hypotheses, a 
model is not verifiable directly by an experiment. For all models are both true and false.... 
The validation of a model is not that it is "true" but that it generates good testable 
hypotheses relevant to important problems”. When one fits a model to data, one obtains 
best-fit values that can be interpreted in the context of the model. 


Some programs automatically fit data to hundreds or thousands of equations and then 
present with the equation(s) that fit the data best. The goal of nonlinear regression is to 
adjust the values of the variables in the model to find the curve that best predicts Y from 
X. More simply, the goal is to find the curve that comes closest to the points. To ascertain 
this, the regression procedure minimizes the sum of the squares of the vertical distances 
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of the points from the curve. For this reason, linear and nonlinear regressions are 
sometimes called least squares methods. Some nonlinear regression problems can be 
linearized by a suitable transformation of the model formulation. For example, consider 
the nonlinear regression problem (ignoring the error): 

Say for the exponential family let us take a model of the form y = a exp'’"' . If we take a 
logarithm of both sides, it becomes log log c/+ Ax . Now only e.stimation of the 
unknown parameters by a linear regression of log(y) on x is required. 

In Curve Expert'"', the nonlinear models have been divided into tamilies based on their 
characteristic behavior. These families and their members are enumerated below: 

Exponential Family 

Exponential models have the exponential or logarithmic functions involved. They arc 
generally convex or concave curves, but some models in this group are able to have an 
innection point and a maximum or minimum. 


Exponential: 

Modified Exponential: 
Logarithm: 

Reciprocal Logarithm: 
Vapor Pressure Model: 


y=a*exp(b*x) 
y “ a*exp(b/x) 
y = a+b*ln(x) 
y = l/(a+b*ln(x)) 
y “ exp(a+b/x+c*lr(x)) 


Power Family 

The Power Family involves raising one or more parameters to the power of the 
independent variable, or raising the dependent variable to the power of a given paiameter. 
This family is generally a set of convex or concave curves with no inflection points or 
maxima/minima. 


y= a*x'’ 
y = a*b’‘ 
y = a*(x-b)‘^ 



Power Fit: 
Modified Power: 
Shifted Power: 
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4 . 

Cico.nclric; >' 

. ^ ^ (b/x) 

Modified Geom;tnc: y-ax 

V = 

Root Fit: y ^ 

Hoerl Model: Y "" )*(^ ^ 

Yield-Density Models 

The yield-density models are widely used, especially in agricultural applications. These 
models historically have been used to model the relationship between the yield of a crop 
and the spacing or density or planting. Essentially two types of response are observed m 
practice: the "asymptotic" and "parabolic" yield-density relations. If the response is sue 
that as density (x) increases, but the yield (y) approaches a fixed value, the relattonship ts 
asymptotic. If the response is such that there is a distinct optimum as the denstty 
increases, the relationship is parabolic. Of course, these types of relationsh.ps oceur 
commonly in other scienlinc areas: therefore, this family of models ,s very uselul. 


Reciprocal Model: 
Reciprocal Quadratic: 
Blcasdalc Model: 
Harris Model: 


y = 1 / (a + bx) 

y = 1 / (a + bx + cx^) 

y (a t bx)’’'^^' 
y = 1 / (a + bx'^) 


Growth Family 

Grow* models are eharaeterized by a monotouie growth from some fixed value to an 

asymptote. These models are most common the engineering sciences. 


Exponential Assoc (2): 
Exponential Assoc (3): 
Saturation Growth: 


y = a*(l-exp(-bx)) 
y = a*(b-exp(-cx)) 
y = ax / (b + x) 


Sigmoidal Family 

Processes produeiug sigmoidal or "S-shaped" growth curves are common in a wide 
variety of applications such as biology, engineering, agriculture, and economtes. These 
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curves start at a fixed point and increase their growth rate monotonically to reach an 
inflection point. After this, the growth rate approaches a final value asymptotically. This 
family is actually a subset of the Growth Family, but is separated because of their 

distinctive behavior. 


Gomperlz Model: y = a * exp (-exp{b - cx)) 

Logistic Model: y - a / (1 + exp (b - cx)) 

Richards Model: y = a / (1 + exp(b - cx)) 

MMF Model: >' " ^ ^ 

WeibLill Model: y = a - b*cxp(-cx") 

Miscellaneous Family 

As with many things in life, some things just don't fit into nice categories. The 
.niscellaneous family is the one m which these "different" nonlinear rcBiession models 


Sinusoidal Fit: 
Gaussian Model: 
Hyperbolic Fit: 
Heat-Capacity Model: 
Rational Function: 


y = a + b*cos(c*x + d) 
y a*exp((-(x - b)^)/(2''‘c^)) 
y a r b/x 
y = a + bx + c/x^ 
y = (a + bx) / ( 1 + cx dx ) 


According to Hyams, “Given a set of data points, often called "observations," a common 
need is to condense the data by fitting it to a model in the form of a parametric equation. 
This "model equation" can be anything that the user desires - it can range from a simple 
polynomial to an extremely complex model with many parameters". One should try to 
uncover the underlying law that data offers and then select appropriate model. Regression 
is one of the several techniques of data modeling. Regression ensures that the "merU 
function", which measures the disagreement between the data and the model, is 
minimized with respect to the adjustment of the model parameters. 



If one take a linear model of the form y = a,X , (x) + a-^X^ (x) +... + a„X„ (x) , where Xi(x) 
could be non-li rear function also but a; are linear. Linear regression can be used to 
minimize the difference between the model and data. The merit function in this case 

would be 


This has to be minimized the parameters at are obtained in this way. 

In the case of non-linear regressions, the following method is used as per the 
documentation of the program Curve Expert. The program uses the Levenberg-Marquardt 
method to solve nonlinear regressions. This , method combines the steepest^lescent 
method and a Taylor series based method to obtain a fest. reliable technique for nonlinear 
optimization. Neither of the above optimization methods arc ideal all of the ttme; the 
steepest descent method works best far away from the minimum and the Taylor series 
meUiod works best close to the minimum. The Levenberg-Marquardt (LM) algorithm 
allows for a smooth transiUon between these two methods as the iteration proceeds. 


In general, the data modeling equation (with one 


independent variable) can be written as 


follows: 


y = y{x-,a) 

The above expression simply states that the dependent variable ‘y’ can be expressed as a 
function of the independent variable 'x' and vector of parameters ‘a' of arbitrary length. 
Note that using the ML method, any nonlinear equation with an arbitrary number of 
parameters can be used as the data modeling equation. Then, the “merit function” we are 

trying to minimize is 




/=l 


y.~yix,;a) 



S3 


Where N is the numher of data points. Xi denotes the x data points, yj denotes the y data 
oints, s, is the standard deviation (uncertainty) at point i, and y(xi,a) is an arbitrary 
Illinea) model evaluated at the i‘‘’ data point. This merit function simply measures the 
agreement between the data points and the parametric model; a smaller value for the 
n,erit function denotes better agreement. Commonly, this merit function is called the chi- 

square. 


A note on Project Gutenberg e-text & e-books 

P,„jec. Ou.onb.rg'‘ is ihe first and larges, single colincUon of fee electronic books or e- 
books. Michae Hart, founder of Project Gutenberg, invented “ 

continues to inspire the creation of e-books and related technologtes today. Proje . 
Gutenberg began in 1971 when Michael Har, was given an operator's accotmt wtth 
S100,000.000 of computer time in it by the operators of the Xerox Sigma V mamframe a. 
the Materials Research Lab at the University of Illinois. 

Hart announced that the greatest value created by computers would not be 
would be the storage, retrieval, and searching of what was stored ,n our 1 bra es^ 
Project's eventual goal is to provide Public Domain c-tcxl editions a short „tnc alter th^ 
enter the Public Domain. Of course, the period before a copyrighted ‘ = 

Public Domain was extended from 28 years (with a 28 year extenston avatlable) 

years more than the life of the author 

The Project Gutenberg Philosophy is to make infomtation, books and other materials 
available to the genera, publte in forms a vast majority of the computers, pro^d 
people can easily read, use. quote, and search. There are toe porttons of the Proje 

Gutenberg Library, basically be described as: 


Light Literature; such as Alice in Wonderland. Through the Looking-Glass, Peter 

Pan, Aesop's Fables, etc. 

Heavy Literatuie; such as the Bible or other religious documents, Shakespeare, 
Moby Dick, Paradise Lost, etc. 

References; such as Roget's Thesaurus, almanacs, and a set of encyclopedta, 
dictionaries, etc. 
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A note on the IIT Kanpur’s e-text 

1990-92 Professor R.M.K. Sinha** conceptualized design of a Machine Aided 
Translation system for translation from English to Indian Languages. This system was 
named as ANGLABHARTl and the underlying methodology named as ANGLABHARTl 
Technology or ANGLABHARTl Approach. 

ANGLABHARTl represents a machine-aided translation methodology specifically 
designed for translating English to Indian languages. Indian languages are relatively of 
free word-order. Instead of designing translators for English to each Indian language, 
Anglabharti uses a pseudo-interlingua approach. It analyses English only once and 
creates an intermediate structure called PLIL (Pseudo Lingua for Indian Languages). This 
is the basic translation process translating the English source language to PLIL with most 
of the disambiguation having been performed. The PLIL structure is then converted to 
each Indian language through a process .of text-generation. 

During 1995-97, Department of Electronics, Govt, of India, sanctioned a grant-in-aid for 
implementation of the project titled "Machine Aided Translation from English to Hindi 
lor standard documents (donuiin of Public iicalllt Cunipuign) based on ANCilAHIIARfl 
approach”. In 1995-96, IITK also designed and developed an Example-based approach 
for Machine Aided Translation for similar (Indian languages) and dissimilar (English and 
Indian Languages) under the leadership of Professor R.M.K. Sinha. This approach has 
been named as AN UB HART! approach. 

Currently, AnglaHindi. the English to Hindi MAT based on Anglabharti methodology, 
which accepts unconstrained text, has already been made available to the users and ts 
very well received. AnglaUrdu which is based on AnglaHindi has also been 
demonstrated. HindiAngla, the Hindi to English MAT based on Anubharti methodology, 
has been demonstrated for simple sentences and further work is going on to handle 
compound and complex sentences. 
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Analysis 

We have analysed the dilTerent documents collected individually. We present the findings 
obtained by the analysis in the following sections. 

Section 1: Zipf s Law in Computer Science Literature 

First Analysis; Whether Zipf’s law is applicable on random text in English language 
from Computer Science literature. 

Zipf s law postulates that the frequency of occurrence of any word as a function of rank 
follows a power law with exponent close to unity. It has been applied to many areas like 
natural languages, monkey-typing texts, web-access statistics, informetrics, finance and 
business and ecological systems, etc. There is evidence of differences on whether the 
power law embedded in Zipfs law is actually a Yule distribution (Martindale'. et al. 
1996), lognormal distribution (Perline% 1996) or stretched exponential distribution 
(Laherrere^ et al, 1998). There have been many applications of the law in natural 
languages, like English (Miller^ et. al. 1958), Chinese (Rousseau^ ct al, 1992), Voyanich 
manuscripl (Landini", 1997), clc. However, there are few applications of the law to 
random texts. Li^ (1998) showed that the Zipfs law is applicable to random texts 
provided it has a very different word structure and length distribution than a natural 

language. 

To investigate more into this area, we have selected a random text from Computer 
Science literature and have tried to find clues on the distribution of rank and frequency. 
An altompl has been made to evolve a new ranking method, based on tied-ranks and a 
comparison has been made with the random rank method, deployed by Zipf® (1949) and 
maximum rank method, deployed by Chen & Leimkuhler (1987). According to 
Mandelbrot'” (1953), “The monkey language is, in the terminology of fractal geometry, 
self-similar and grows on infinite trees (any branch of the tree will be identical to the tree 
itself), thus needing an infinite dictionary. A natural language like English, on the other 
hand, is a massively geared down system that economizes on entropy in a number of 
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ways, e.g., the interdependence— or redundancy— of words that seems necessary in order 
to make a text “meaningful.” Most letter combinations (an uncountable set) in English 
are non-words”. However, the random text taken for analysis in this communication is 
called “random” only because though it is in English, it follows a very subject specific 
usage of words, e.g. use of hyphenated words. Hence, in this communication, the random 
text used, differs from monkey typing text by only one virtue, i.e. every word in this 
random text has a definite meaning. 


Methodology 

To study Ihc application of Zipfs la\v and the performance of the new ranking method on 
random texts, the authors have taken a text from a computer science " Operating System - 
Concepts and Design", by Milan Milenkovic*' , Second edition, 1997 ( Tata McGraw 


Hill, New Delhi ). 


Word Example 

Length 

Freauency 

_ ^ 

• AN 

" " CAD 

1 

2 

3 

205 

■ 1765 

1580 

1 1 AA 

AREA 

LOGIN 

■ “ DESIGN ] 

“ATiDTtESS 

4 . . 

5 

6 

i iUU 

730 

856 Z3 

1 

1076 

LANGUAGE 

8 

844 

INTERVALS 

9 1 

775 

CONCURRENT 

10 

423 

I itlLlZATION 

11 

285 

ABSTRACnONS _ 

12 

165 

___ 

COMMUNICATION 

13 


"nis'ER-SPECiFlED 

14 

37 

1 AMfiF password 

15 

40 I 

Le I 1 1 N VJ 1 / \ O lajf K T ft I--' ^ 

rfmote-procedure . 

16 

54 

MEMORY-MAN AGEMEN 1 

17 

7 ' • 

PROGRAMMER-DEFINED 

18 

5 ' 

A nnR fsS-TRANSLATIOK 

19 

4 

1 fwi/vu PR fOR ITY-R ASED 

20 

3 

LU W C/K-r IxlV/Ivi 1 I 

rOMPUTATION-lNTENSIVE 

21 

2 

TRANSACTION-PROCESSING 

22 

1 

APPLICATION-PROGRAMMING 

23 

1 


Table 4.1: i,ccordmg to hrigth & frequency in Compuw Sc Literaure 
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The authors have counted the frequency of occurrence of each unique word in the text, 
and found 1775 unique or different words out of a total of 10,043 words in the full text. 
It was observed that the words of less than 9 characters in length were extensively used. 
However, one striking characteristic of computer science literature was the use of 
hyphenated words, which makes the word length vary over a large range. One can easily 
see from the table below that after words having 13 characters, there are a series of 
hyphenated words. 

Use of hyphenated words can be taken as a special characteristic of the text taken, i.e. the 
computer science literature. It would thus be interesting to investigate the rank and 
frequency relationship as propounded by Zipf and other scientists in such a text. The 
authors have intentionally kept the hyphenated words as they are. One can also see that 
hyphenated words are typical in describing the very specific nature of the meaning they 
convey in the concerned literature. Some of them are the commands given to the 
computer to perform specific tasks. 

All unique words were arbitrarily ranked according to their frequency ot occurrence in a 
decreasing order. Words, which shared the same frequency, were arranged alphabetically 
and dilTercnl ranks were assigned to each of them according to Zipfs approach of 
random-ranks. Thus, the words "able" got the rank(r) 868 and the word "writes" got the 
rank(r) 1775. One can see that two v/ords contributing 1 occurrence each are assigned 
random ranks 868 and 1775, respectively according to Zipfs random rank approach. This 
leads to steps for large values of rank. This is one of the disadvantages with the random 
rank method. Chen and Leimukuhler^ (1987) had overcome this problem, by using the 
maximum rank for all the words with the same rank. Also their method helped in 
preserving the convertibility between frequency-rank distribution & frequency-count 
distribution and vice-versa, which was not possible in random rank approach. 

Another method proposed by us is based on the concept of "ties", which means, that if 
two observations are tied, i.e. they have the same frequency then they should be assigned 
the ranks according to the average of their random ranks. This was done in order to 
stabilize the product r x g(r), especially in the last rank-range. Here, r is the word rank 
and g (r) is the rank frequency i.e. the number of words of the same rank. 
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Analysis and Results 

The authors had expected that the new ranking procedure based on “ties” would be able 
to minimize the dispersion of the product r x g(r) in all the rank range due to a simple 
logic that the maximum rank would always be greater than the average rank. A 
preliminary analysis of the product r x g(r) is as follows: 


Rank 

r X g(r) bj 

Maximal Rank Method 


Max 

Min 

Std. Dev 

1-10 

1240 

553 

227.45 

ii-51 

1485 

1239 

57.79 

52-99 

1548 

1352 

56.23 

108-228 

1596 

1512 

30.99 

276-1775 

1775 

1656 

40.47 


r X g(r) 

by Tied Rank Method 

Max 

Min 

Std. Dev 

1377 

553' 

227.4 

1501 

1239 

62.85 

1503 

1352 

1 46.15 

1503 

1456 

16.29 

1538 

1321.5 

83.79 


Table 4.2: Rank frequency relationships in different rank methods 

It can be seen from the above table that the r x g(r) is distributed with fairly less 
variability but for the rank-range (l-IO). This is due to the fact that observation with rank 
1 is a clear outlier. If we delete that observation from our calculation of standard 
deviation then the variability substantially reduces and comes down to 104.61 instead of 
227.45. Also an interesting observation is that method of tied rank shows the same 
variability in the rank range (1-51), perforins better in the rank range (52-228) and 
performs badly in the rank range (276-1775) when compared to the maximal rank 
method. 


Statisticiii 

Measure 

'~Std. Dev 

Mean 

% c. V 

Min rank 

Max rank 

For linear fit y=a+bx 
Parameters 


Ranking Procedure 

Zipf Chen 

223.76 99.14 

1393.93 1718.16 

16.052 5.77 

1 1 

1775 1775 


Standard Error t)-Q67 y Mri 

Correlation Co efficient Q-995 | 

Tabic 4.3: Comparison of different ranking models 


a =3.05 
b = -0.96 
0.057 
0.995 


a =2.99 
b = -0.91 
0.039 
0.997 


Tied 
' 86.47 
'1393.93 
6.20 
1 

1321.50 

a =3.03 
b = -0.93 
0.045 
~ 0.997 
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Here Standard Error (S) is the stand;ird error of the estimate which qutmtifies the spread 
of data points around the regression curve and Correlation Coefficient (r ) is the square- 
root of the normalized difference between the spread around mean and spread around the 
fitting function. As the regression model better describes the data, the correlation 
coefficient will approach unity. It can be seen that the random texts taken from the 
computer science literature do exhibit Zipf-like distribution with the slope of the linear 
regression touching unity. 1 lowever, there is a marked difference in the performance of 
Maximal Rank and Tied Rank verses Random Rank of Zipf. There is a need to see 
wlieilicr the alloniative ranking procedures perform better in other texts. 

As far as the distribution of rank and frequency are concerned, it is found that the relation 
is a Shifted Power distribution (Mandelbrot Zipfs law) of the form 

g{r) = air + hY 

where the coefficients are estimated as a = 3301 .44, b = -2.99 and c = -1 .23 



Where S and r are as defined above. It can been seen that the power distribution 
(Mandelbrot Zipfs law) is fitting this type of data fairly well but with a slight 
modification in the form and parameters for different texts.. Besides this, the authors 
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Axrs fum'/s) 


X Axis (units) 


V 

Figure 4.2: Plot of log rank with logfreq. for random rank method 
for Computer Science Literature 



Figure 4.3: Plot of log rank with logfreq. for maximal rank 
method for Computer Science Literature 










S»0.0S53051S 
r = 0.99521606 


Figure 4.4: Plot of log rank with logfreq.for tied rank method for Compute} 

It could be seen very clearly that both the Maximal rank method am 
perform better than the Random rank method of Zipf. It can be seen 
rank-range at the end. 


Discussion 


From the figures given in the earlier section it is evident that me lower laii ^coiuauuus 
lower ranks) of the plot of log rank vs. log frequency behaved in the best possible manner 
in the case of Maximal rank. The scatter in tied rank method was better than that in 
random rank method but not better than that in the maximal rank method. The question 
that naturally arises is whether the ranking method had a bearing on the type of text in 

question. 


Conclusion 
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Second Analysis: Whether the distribution of words according to their length and the 
hits they are able to generate on the popular search engine "Google "follows Zipf’s Law. 

Introduction 

Any text is made up of words of variable length. The distribution of words contained in a 
text itself is of great interest to scientists. Web is probably the largest mass of words of 
various kinds. Many scientists have attempted to examine the informetric properties of 
the web in the past. Rousseau'* (2001) has tried to analyze a time series of the number of 
hits of word “Euro” on the web during a period of one year. Lee Breslau‘S et. al.(1999) 
raised an issue that whether web requests from a fixed user community are distributed 
according to Zipf s law. They found that the page request distribution seen by web proxy 
caching using traces from a variety of sources does not follow Zipfs distribution 
precisely, but instead follows a Zipf-like distribution with varying exponents. Chao & 

D’haescleer''^ (2001) attempted to find the distribution of Variable length Phatic 
interjcctives on the World Wide Web. They found that the number of pages found 
containing these words would fall off as a power law. However the exponents for length 
frequency distributions of different interjectives were much larger than -1 predicted by 
Zipfs law. In this paper, we have tried to c'camine the distribution ol words according to 
their length and the hits they are able to generate on the popular search engine “Google”. 

Method and Analysis 

We have taken the hits at a particular point of time just to take a rough estimate of the i 

distribution of these words on the Internet. “Google’ offers a scale-free network a-priori 

as it crawls the web from its current database. However it will be good to try the similar 

search on different search engines. Driven by Bar-Ilan'^ (2001) as cited in Rousseau 

(200 1 ) that the most cybermelric research results more in statements of principle than in 

exact results. Hits at a particular point of time were taken just to take a rough estimate of 

the distribution of these words on the internet. The constraints in getting the data, which 

stood unchanged for longer period, were accepted and an empirical approach was taken 

to explore the distribution. 
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The authors have taken a text from a computer science " Operating System - Concepts 
and Design", by Milan Milenkovic, Second edition, 1997 (Tata McGraw Hill, New 
Delhi). The authors have counted the frequency of occurrence of each unique word in the 
text, and found 1775 unique or different words out of a total of 10,043 words in the full 
text. We searched for the number of hits each unique word obtained on the search engine 
“Google”. 

We tried to find out whether there is any relation between the word length (i.e. the 
number of alphabets in the word) and the number of hits it gets. The distribution of words 
has the following descriptive properties 


W ord-length 


Average Log 

log hits length Minimum Maximum 


Mean 

8.87 

8.61 




Table Descriptive Statistics of the words used in Computer Sc Literature 

1 he distribution of words in the given text follows a distribution similar to Hoerl Model 
of the form y = ab''x' At can be seen that barring the words with lengths >19, all word- 
lengths have made appearances more than once. In fact il we do not consider tlie words 










with length more than 1 6, we may treat this distribution as a Gaussian distribution. This 
might not reflect anything at this stage but when we plot the Log-length with the log 
average hits this would have a huge impact on the inferences drawn. 


S«10.$S1I3333 

r'MSJSiioTt 


XAxis(units) 

Figure 4.5: Distribution of words w.r.t. length vs. frequency 
To sec whether there is a law embedded in the distribution we plotted the data obtained in 
the following three manners. 


8 • 0,74389738 
r> 0.87171638 


0.1 4.3 8.5 

X Axis (units) 

Figure 4.6: Plot of Length (all) vs. average log hits 
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'an j"’‘'$ ■> 0.41232787" ;; 
; :. ;,.r« 0.95031099 « 


5»a119913M 

r>0.»9666472 


X Axis (units) 

4 . 8 : Plot of Length (Up to 18) vs. average log hits 
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parameters of 4*'' Degree Polynomial Fit were a =9.79, b = -0.91, c=0.1 1, d —0.006 and e 

= 0 . 0001 . 

Conclusion 

This result found is very similar to one found by Chao & D’Kaeseleer'"^ (2001) for the 
distribution of Variable length phatic interjectives on the World Wide Web. It is a Zipf 


type distribution with exponent not close to unity (In fact it came out to be -3.51). 
However, interesting thing here is that there do exist some relation between length of the 
word and the number of hits it gets on the web search engine “Google” provided the word 



Section 2: Zipfs Law in English Literature (Aladdin and the Wonder Lamp^^) 

For this section, we have selected The Project Gutenberg e-text of “Aladdin and the 
Wonder Lamp”, a "public domain" work, distributed by Professor Michael S. Hart 
through the Project Gutenberg Association. Project Gutenberg is the oldest producer of 
free ebooks on the Internet (http://www.gutenberg.pjg/ ). 

The choice of this text was done mainly due to the following reasons: 

1 . It is a popular work and it is written in a very simple manner so the words used 
are easy to understand and commonly used. 

2 German translation of this text was available so that a comparison can be made 
between the English and Gennan language after the analysis of the text. 


The following statistics were obtained for the text under consideration. 


’Stiitistic for the document — IZZI— J 

I\k,iniber ot words 

5319 

Number of sentences 

661 1 

"Niiiiibcr ni words por sentence 

8.05 

Number of syllables per word 
(approximate) — 

1.54 

I 


The Flesch index of readability for this document is 68% this means it is a fairly easy 
document to understand and i,s of the level of eighth standard. The prominent words, 
which were obvious to get more occurrences in this text were, genie (24), lamp (24), 
mother (25), palace (.10), magician (32), sultan (43). princess (45) and Aladdin (98). 
There were lot of occurrences for the supporting words like, a, of, he, and, to & the. 

The zipfian data for the text is obtained and is presented on the table given on the next 

page. 


I given document, the 


d Idesch rettdabmty tndex is an tnteger mdicaung how difficult the document ts to 


understand, with lowt r numbers indicating greater difficu t) . 

syllables ^^ords 

Flesch Index = 206.83 5 - 84.6 * 


vords 


- 1 . 015 ’^ 


sentences 


71 



_^_35 


38 


42 

45 


V- ■ 


±L I 



4.6: Zipfum data for the text (Aladdin and the Wonder Lamp) 


a 1 add in 
of 
he 

to 

and 

the 


Occurence 

96 

98 

109 

122 

i65~ 

222 

385 


Rank 

7 




TjE&4:7nto7^i«t W W»r* (Aladdm mi Ihe Wo^iar Lamp) 

Linguists urc puzzled by the phenomena that most words are not used much while some 
occur many a time. Zipf explained this and called it “principle of least etlort. He clatmed 
tha, people minimized their efforts in using language. Zipf s law thus became a feature of 

human language. 




liiiiiii 


illiiiliiliiiSliii 



















this framework and preliminary analysis, we proceeded for the regression and curve 
3 for this text. 


$^42.58263517 

r«0J677758S 


Figure 4*9: Plot of rank & frequency in Aladdin and the W onder Lamp 


S» 0.08961042 
r® 0.98537683 


Figure 4.1 0: Plot of log rank & log frequency in Aladdin and the Wonder Lamp 



The Linear Fit y = a+bx for the log rank and log frequency data obtained coetficients as 

a = 0 76 and b =-0.92. The residual' plot, which shows the difference between the data 
points and the model, evaluated at the data points is shown below. 


Residuals 


Vi 



X Axis (units) J 

Figure 4. 1 1 : Reskhia! Plot. for dot, i points inodri in Aladdin and the Wonder Lamp 

The result veritied that ZipTs law is applicable in this text and for the Mandelbrot Zipf s 
law ( gC/*) = ) ^he coefficient c in this case is -0.92. 


' 1 hf rfsKltittl .11 point k is tlclincd liy Kcsiduuli Jt 

Vtliere Vk is the measuied value at Xk, and f(xk) is the predicted value at Xk- 




Section 3: Zipf s Law in German Literature (Aladdin unddie Wmderlampe^'*) 


For this section, we have selected The Project Gutenberg e-text of “Aladdin und die 
Wunderlampe”, by Ludwig Fulda with original illustration by Max Liebert. Project 
Gutenberg is the oldest producer of free e-books on the Internet 
( ),tt,v/7www.uutenberu.orR/ ). 

The choice of this text was done mainly due to the fact that it is a popular work. English 
translation of this text was available so that a comparison can be made between the 
English and German language after the analysis of the text. However this version is a 
more elaborated one as it contains illustrations also. This can be verified by the fact that it 
contains almost three times of the words in the English version. 

The following statistics were obtained for die text under consideration. 


Stiitist ic for the document 


Nuinber 

Number 


of w ords 

of sentences 


lor of words per sentence 


Number of syllables per word 

{ 


J^7^6 

3536“ 

5.00 

1.70 


57.90 



The Flesch inde.x o!' readability for this document is around 58% this means it is a fairly 
easy document to understand and is of the level of high school standard. The prominent 
words, which were obvious to get more occurrences in this text were, “und” (and), “die 
(the), “er” (he), “z.u” (to), “in” (in), “sich” (itself) and “von” (from). So it is reaffirmed 
that there were lot of occuiTcnces for the supporting words like, a, of, he, and, to & the. 
In the English version we found the most happening words, which are context specific 
and are typical for the subject of the text. We tried to make a comparison of these words 
in English and German versions. The table given below gives an account of the 


’ For a given document, the Flesch readabilit)' index is an integer indicating how difficult die document is to 

imdersttind, uith lowest numbers indicating greater difficulty. 


Flesch Index = 206.8.1 5 - 84.6 * * -0 * ^ 









t 





magician (32) j 
sultan (43) 


Mutter 


Palast 


Zauberer 


Sultan 


*- These -words may be used in a different form in German text. 

Table 4.8; English Words vs German Words & their frequency in Aladdin und die Wunderlampe 
There were 4929 different words found in this analysis. The zipfian data for the text is 
obtained and is presented on the table given on the next page. 

Rank Frcq Log Rank Log Freq 25 95 1-40 1-98 

1 561 0.00 2.75 27 92 1.43 I 1.96 

2 426 0.30 2.63 91_ L96„. 

3 ' 371 ()..18 2. .57 29 90 ...hlO L?! 

4 365 0.60 2.56 30 88 1. 48 1 . 94 




, 





































— ' r „ I.i s n * :a.s__'L...:...^ 


I 



74 

77 


81 


83 


85 


86 


90 


94 


100 


103 


104 


108 


41 

1.79 

1.61 

40 

1.80 

1.60 

3^ |- 

I.gl 

1.57 

W 

.32 

1.56 

35 

1.83 

1..54 

34 

1.85 

1.53 

W 

1.85 

1.52 

32 

1.86 

1.51 

31 

1.87 i 

1.49 

30 

1.89 

1.48 

1 w 

1.91 

1.46 

1 2 S ' 

1.92 

1.45 

t 27 

1.93 

1.43 

26 

1.93 

1.41 

25 

1.95 

1.40 

I 24 

1.97 

1.38 

23 

2.00 

1.36 

22 

2.01 

1.34 ^ 

21 

2.02 

1.32 

20 

2.03 

1 1.30 


110 


117 


118 


131 


137 


144 


151 


164 


176 


192 


218 


238 


275 


319 


361 


453 


602 


911 


1633 


19 


18 


17 


16 


15 


14 


13 


12 


11 


10 


_ 9 _ 

8 


_ 2 . 

1 


2.04 


2.07 


2.07 


2.12 


2.14 


2.16 


2.18 


2.21 

2.25 


2.28 


2.34 


2.38 


2,44 


2.50 


2.56 


_ 2 . 6 ^ 

2'.78 


2.96 


3.21 


1.28 


1.26 


1.23 


1.20 


1.18 


1.15 


1.11 


1.08 


1.04 

1.00 


0.95 


0.90 


0.85 


0.78 


0.70 


0.60 


0.48 


0.30 


0.00 


a - 

Linguists arc puzzled by the phenomena that most words are not used much while some 
occur many a time. Zipf e.xplained this and called it “principle of least effort. He claimed 
that people minimized their efforts in using language. Zipf s law thus became a feature of 
human language. With this framework and preliminary analysis, we proceeded for the 


regression and curve fitting for this text. 
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Bleasdale Model j; = (a + bx) " with coefficient data a = 0.06, b = 0.002 and c - 0.47 

fitted this data in the manner which is shown in the graph given above. Bleasdale model 
is a yield-density type models. The prominent characteristic of this model is that if the 
response is such that as density (x) increases, but the yield (y) approaches a fixed value, 
the relationship is asymptotic. 


S' 


S s 0.09803209 
r » 0.98282S&4 



l».0 


Figure 


4.13; Plot of log ranks vs. log frequency in Aladdin unddie Wunderlampe 


The Linear Fit y = a+bx for the log rank and log frequency data obtained coefficients as 
a = 3.20 and b =-0.92. The residual plot, which shows the difference between the data 
points and the model, evaluated at the data points is shoxvn below. 








gpption 4; Zipfs Law for English-German Business Dictionary (Mr. Honey's Small 
Business Dictionary (English-German) 


For this section, the Project Gutenberg E-text of “Mr. Honey's Small Business Dictionary 
(English-German)” by Wiiifried Honig was taken up. Mr. Honey (Winfried Honig) 
compiled English/German dictionaries for almost 3 decades to provide his colleagues and 
students witii samples of the language of business and highlight the need for special 
dictionaries covering the speeial hingnage u.seil in diffcrcnl Imiiiehes ol' the industry. 
These wordlists are now fed into the LEO Online Dictionary (http://dict.leo.org) and the 
DicData Online Dictionary (httir// vvww.dicdata.d e ). 

The choice of this text was done mainly due to the following reasons: 

• Dictionary does not follow any linguistic style of writing. They only depict the 
words or group of words whose meaning are to be given. 

• Small Business dictionary was taken as it was specifically devoted to business 
w'ords. 

• Translation from German to English was given. The benefit derived from this lies 
in the opportunity we got in investigating the number of words required in the 
other language to explain the original word. 

We have separated the English words from the German words and made two text files. 
The following statistics were obtained for the English part under consideration. 


Statistic for the document 


Number of words 

10089 

Number of sentences 

5763 

Number of words per sentence 

1.75 

Number of syllables per word 

1 (approximate) 

2.54 

1 Flesch index** 

-9.95 


The Flesch index of readability tor this document is around -10%. This means it is a 
fairly difficult document to understand and is of the level of a law school graduate. The 


^ For a given document, the Flesch readability' index is an integer indicating how difficult the document is to 

uiuiersiand, with k»wer numbers indicating greater difficult)'. 

svlhibtes words 

Hrsch IniL’.x = 206.835-84,6*-^^ ^ 1.015*- 


words 


sentences 
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ironiiiicnl \\’i>rds, whicli were ohvmm to get more occurrences in this text were, particles 

ike "ol , ^ thte typical words for this type of text like 

'price ', “goods , account , tax , capital , “market”, “trade” and “costs”. These 
Ajords also appear nioic number oi times because dictionary was containing a 
;ombination of these words like “abandon a business”, “abandon a plan”, and “abandon a 
)roject” were the three entries in the dictionary so “abandon” has come three times and 
ilso “plan”, “business” and “project” got one more count. The same would be true for the 
jerman part. 


fhere were 372 . 
)btained and is p. 


different words found in this analysis. The zipfian data for the text is 
-esented on the table given on the next page. 


jLog RanklLog Freq 


Rank 

42 


lank Log Fret 


1.62 

1^4 

"1.66 I 

1.6 9 I 

1.72 ' 

111 

1.81 " ' 






With this framework and preliminary analysis, we proceeded for the 

fitting for this text. 


regression and curve 




Zipf “pniiciple of least effort” is a not -i vilin , 

the words that are required for the small business It will th u ■ ^ 

, . P , . thus be important to see whether 

tic zipl s law IS applicable iii iliis conteyf u i 

„ . , . , Ihat people 

efforts m using language is also not applicable here. 



a = 2.35 and b -0.66. The residual plot, which shows the difference between the data 
points and the model, evaluated at the data points is shown below. 
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T|,c result vcilieJ Ihul Zipl'.s It.w is nut appliettblo in this text ttttd Ibr the ManUulbru, 
Zipl’s law ( g(r) = «(n s- by ) the coefficient c in this case is -0.66 that is no, close to -1. 

The following st atistics were o btained for the German par, under consideration. 


Statistic fo r tlie clocument 


Number of words 

9107 

Number of sentences 

5792 

Number of words per sentence 

1.57 

Number of syllables per word 


(approximate) 

3.41 

Flesch index^ 

-82.83 


The Flesch index ot readability for this document is around -82%. This means it is an 
extremely difficult document to understand and is of the level of a law school graduate. 
The prominent xvords, which were obvious to get more occurrences in this text were, 
parlicic.s like ol , a , ‘and , “ yn" and "for” which arc shown in the following table. 
Ihe typical words foi this type ot text are also given in the third column of the following 


Ccniiaii 

lier 

ill 

auf 

us 

dcs 

von 

nicht 

eines 

sich 

und 


Kntilish 

the 

in 


Frequency Gcniiaii Eiialis h 
eiiier mw 


eiiier 

ciii 

preis 

ware 

einen 

kosten 

gesetz 

markt 

nachfrage 

angebot 


one 

one 

price 

commodity 

one 

cost 

law 

market 

inquire 

offer 


Table 4.11: German Words, their meaning c& frequency in Mr. 
1 loney'.s Small Business Dictionary 


Frequency 

24 

22 

21 

20 

18 

18 

16 

16 

16 

15 


5p , 

Ot a given document, the Fiesch readabilit)' index is an integer indicating how difficult the document is to 
un erstand, with lower numbers hidicating greater difficulty. 


«Mc/y = 206.835 -84.6* 015*-11^ 


1 , 015 * 

words sentences 








, ,,,, , . , ciiticrcnt form is used at 

different places. I his might be a special Dronertv nf r 

H propeny ot the German language. There were 

5913 dilTereni words found in this analysis The -/infi-.n ^ * r , 

, , , ^ ^ data for the text is obtained and 

is presented on the table given on the next page. 

— - 1 [ Rank Fm, J 



1.97 

1.63 

1.62 

1.61 

1 .60 

1.53 

1.48 
1 .40 
1.38 
1.36 
1.34 
i.32__ 
1.30 
1.26 
1. 20 


Rank 

Freq 

Log Rani 

Log Freq 

26 

15 

1.4! 

1.18 

29 

14 

1.46 

1.15 

33 

13 

1.52 

1. 11 

36 

12 

1.56 

1.08 

38 

11 

1.58 

1.04 

43 

10 

1.63 

1.00 

49 

9 

1.69 

0.95 

58 

8 

1.76 

0.90 

68 


1.83 

0.85 

87 



6 

1.94 

0.78 

108 

5 

2.03 

0.70 

154 

4 

2.19 

0.60 

259 

3 

2.4! 

0.48 

506 

2 

2.70 

0.30 

1376 

L i 

3.14 

0.00 


- — * — ; — — j. i 

iable 4.12: Zip/ian data for the text (German) in Mr. Honey's 

Small Business Dictionary 

Zipf principle of least effort” is a not a valid connotation here as dictionary mentions all 
the words that are required for the small business. It will thus be important to see whether 
the zipf s law is applicable in this context. The hypothesis that people minimized their 
elloils 111 using language is also not applicable here. With this framework and preliminary 
iialysis, we proceeded for the regression and curve fitting for this text. 



S « 0.06259880 


r ■ 0.98971406 


I-og Rank 


I'lf-iirc 4.17: Phi of log ranks vs. log frequency in for Mr. Honey 
Small Business Dictionary (German Words) 


I The Linear Lit y - a+bx for ilio log rank and log frequency data obtained coefficients as 
a = 2.02 and b =-0.63 . The residua! plot, which shows the difference between the data 
: points and the model, evaluated at the data points is shown below. 



Figure 4.18: Residual Plot for data points & model in Mr. Honey's 

Smal! Business Diciionafy (German Words) 


The result verified that Zipf s law is not applicable in this text and for the Mandelbrot 
Zipf s law (g(r) = a{r 4- bf ) the coefficient c in this case is -0.66 that is not close to -I . 







SjctioiiS: Zipt s Law in Hindi Literature (Eidgaah byMtmshi Premclmntf‘) 

F„, tliis aectitin. we have selected IIT Kanpur's e-text of roman version of a story called 
-Eiiigaah- by Munshi Premchand (htiEa /ww'v.muiismremnh.^a ii,u c-. j 

This website has been built as part of a larger effort to create a series of websites based 
on Indian philosophical texts. This website has been built under a project in the 
Department of Computer Science & Engineering at the Indian Institute of Technology 
Kanpur. 

The choice of this text was done mainly due to the following reasons: 

. It is an eiiic work by one of the finest writers of Hindi Prose and it is written in a 
very simple manner so the words used are easy to understand and commonly 
used. 

• 1 he Ionian version ol the original work in Hindi was available by the pioneering 

work done at H I' Kanpur. It was the part of a website which offers stories of 
Munshi Premchand in portable document format version (.pdf) with a facility to 
translate it in many Indian languages. The roman text was obtained in this manner 
only. 


• Dynamic honts were u.sed to di.splay Indian languages. But however this did not 
work for us and thus we downloaded font for roman (DVl-TTYogesh). These are 
the fonts that are made by Centre for Development of Advanced Computing 
(CDAC) 

The following statistics were obtained for the text under consideration. 


Statistic for the document 


Number of words 

4951 

Number of sentences 

505 

Number of words per sentence 

9.80 

Number of syllables per word 
(approximate) 

1.99 

1 Flesch index 

28.74 


aimed at the level of a college graduate. This was departure from our present 
understanding that this is a very simple document which is prescribed at the school level 
books as essential reading and more so as a chapter in some text books. The reason for 
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Wid, fl.is fmmcwork and preiimina^- analysis, wa proceeded for fte regression and curve 
fitting for this text. 


S«S.S4757974 
r» 0.98736665 


104.8 209.4 314.1 418.7 | 23.4 628.0 

X Axis (units) 


figure 4J9; Ploi of ranks vs. Frequency in Eidgaah 

Hoerl Model: = ah ' .v‘ with coefficient data a = 168.59, b = 0.98 and c = - 0.37 fits 


ihis data 


S a 0.09942305 ' 
r = 0.97906317 i 


r igure 4«20: Plot of lag ranks vs, log frequency in Eidgaah 


he Linear Lit y a+bx for the log rank and log frequency data obtained coefficients as 

3 - 2.54 and b = - 0.82. The residual^ plot, which shows the difference between the data 
points and the mocel, evaluated at the data points is shown below. 

ri'c residual at point k is defined by Rcsiduak = Vk - f(xk) 




Figurt! 4.21: Reskiital PlotfoK data points & model in Eidgaah 


The result verified that Zipfs law is applicable in this text and for the Mandelbrot Zipf 
law ( g{r) - a{r ) h)' ) the coeHicient c in this case is -0.82. 


heic vk i.s (he iiiciisiircd at Xk, and r(xk) is the predicted value at Xk- 



^ Sectioni: Zipfs Law in a text from Library Science Literature (“The Lihrary^^ ", by 
' Andrew Lang) 

For this chapter, we have selected The Project Gutenberg e-text of “The Library”, by 

Andrew Lang #20 in our series by Andrew Lang, December 1999. The choice of this text 
: was done mainly due to the following reasons: 

. It is a subject specific work. The aim of selecting this text was to find the pattern 
of word usage particularly from a text from a subject area. 

I . It wi 11 provide a comparison with the earlier analysis of Computer Science 

literature i.e. it will enable us to mttke a comparison between the two subject on 
the count of applicability of Zipfs law. 

The following statistics were obtained for the text under consideration. 


Stati.stic for the document 


Number of words 

37498 

Number of sentences 

5037 

Number of words per sentence 

Number of.syliahics per word 

(approximate) 

7.44 

1.7J 

flesch index^ 

53.33 


t The Flesch index of readability for this document is 53.33% this means it is a document 

: aimed at the level of a high school student. So this is a very simple document, which can 

be taken as an elementary reading in the Library science literature. One particular 
: characterstic of this document was the occuirence of lot of alphanumeric words and 

numbers. Some of these numbers were years while some were page numbers given in the 
; iclercncc. Ihcre were 168 diflerent numbers that appeared in this text and they were 
removed from the Zipfs analysis. The numbers like, 0, 1, 10, 100, 1000, 11, 12, 120, 131 


' P ■ 

0^ a given dociirneiif, tlie Flesch readability index is an integer indicating how difficult the document is to 
with lower numbers iiidicauiig greater difficulty, 

Fksch Index = 206.835 - 84.6 * - 1 .01 5 * 

words sentences 
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j„d 13 appeared at variom places in the text. A description of most occurring such words 

is given in the table on the next page. 

The prominent numbers, wliich got more occurrences in this text, are as follows: 


{able 4. IS. I^uiuht’fs i& their fretjusney in “The Library'' 

Since this was a text taken Irom a specific subject area there were bound to be use of 
context-specific words. I he following is an example taken from a words “bibliography” 
and the its related words. The frequency of occurrence of these words is also shown in 
the table given below. 


Word 

Meaning 

FreqiieEcy 

Bibliographical 

Relating to or dealing with bibliography 

2 

Bibliography 

Bihliokicpl 

A list of writings with time and place of 
publication (such as the writings of a single 
author or the works referred to in preparing a 
document etc.) 

A per.soti who has a cumpiilskm to steal books 

7 

11 

Bibliokleptoinaniac 

One who has a morbid tendency to steal books 

1 

Bibliokiepts 

Persons who has a compulsion to steal books 

4 

Bibliomania 

Preoccupation with the acquisition and 

possession of hooks 

2 

1 

Bibliomaniac 

Person %vho has a preoccupation with the 

acquisition and possession of books 

1 

Bibliopegia 

Relating to the binding of books 

1 

Bibliophile 

Someone who loves (and usually collects) books 

22 

Bibliophiles 

Person who loves (and usually collects) books 

7 

Bibliotheca 

A collection of books 

1 

Bibliothec 

4 professional person trained in library science 
and engaged in library services fsyn: librarian] 

1 


Table 4.16: Word meaning cfe their frequency in “The Library". 
Source: WordNet ® 2.0, © 2003 Princeton University at 
http://dictiimarv.reference.com/hrowse/Dihlioeraphical 


Word 

Frequency 

6 

5 

10 

5 

1880 

5 


4 

8 

lI.., 4 

13^ 

4 


Word 

Frequency 

1 

10 

2 

9 

4 

9 

3 

8 

1830 

6 

5 

5 




Another striking characteristic found in this document is on the use of connecting and 
supporting words. We have analyzed such words and found that among top 100 most 
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occurring words these words are almost capturing 50 % of the total words in the 


Connecting & 
Supporting 
Words 

Frequency 


Connecting & 
Supporting 
Words 

Frequency 

The 

2746 


On 

205 

Of 

1 1913 


r Not 

203 

And 

1156 


L_. At 

186 

A 

r 906 


This 

171 

In 



Have 

160 

'• 



All 

136 

Is 

540 


Has 

136 

His 

1 f’'.: 


One 

121 

That 

r 7:-i 


Who 

121 

S 

:• : 3 


From 

116 

i!e 

308 " 


fiicre 

^1— 

Are 

3v:: 


Ti'.C.'-C 

114 

For 

3;3 


May 

109 

It 

299 


Ai! 

106 

With 

279 


Their 

106 

As 

276 


De 

103 

But 

243 


oid 

103 

Be 

237 

; 

They 

103 

Which 

230 


Were 

103 

Was 



So 

92 

Or 

224 


Sonic 

91 

By t 

220 


More 

89 




Mr. 

89 




Like 

87 

. - ._1 

i 

.... _ — j 

[ 




Table 4.17: Most occurring words {The Library) 


Connecting & 
Supporting 
Words 

Frequency 

Been 

86 

We 

83 

When 

-- 

82 


79 

Had 

78 

M 

76 

Them 

75 

Will 




71 

Most 

67 

Its 


No 

65 

Than 

65 

■ If 

59 

Many 

57 

First 

54 

Would 

54 

Can 

53 

Such 

53 

Him 

52 

liven 

51 

Our 

51 


48 

Other 

48 

Total 1 

17491 


However other subject specific words like “library”, “little”, “printed”, “years”, 
[■ 'Tllustrated”, “volumes”, "work”, “amateur”, “edition”, “volume”, “collection”, 
English , “ modern”, “art” and “century” also occurred significantly. 


ri 

!■ 

I 

I 

Out of the 6721 distinct words, the zipfian data for the text is obtained and is presented 
® the table given )e!ow. 
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nank Fmi Rank Freg 
TIM O.QO T 3.44 
T” 1913 0.3 0 3 . 2 ^ 
T Ti56 6^48 T 3.06~ 
4 906 0.60 2.96 

1 m 0.70 2.90 
789 0.78 2.90 

7 540 0.85 2.73 

8 l67 0.90 i 2.56 
i T" 314 0.95 2.50 

jT lis 1.00 p.so 

11 308 1.04 1 2.49 

12 303 1^2.48 

T3 301 1.11 2.48 

14 300 ! 1.15 2.48 

15 299 11.18 12.48 

16 279| ~i.20~ 2.4^ 

17 276 p.2l |2l4 

18 243 126 j2.39~’ 

'l9 23T r.28l2.T7 

20 2.K|_J..?0 2. ip 

21 225 02 p35~ 

22 224 1.34 * 2.35' 

23 2 20 1 .36 I 2.34 

24_^ 206 1.38 2.31 

25 lo’s i 1.4()’ 2'.3T 


Rank Frcq 
~40 103 

- 

45~" OT 
46 89 

48 87 

"'49 \ir 

50 83 

51 I 82' 
"5^179 

53~j"7r 

54 I 76 

55 75 

56 74 
~57 I 71 

58 ' I 67 

' 59 66 

6 O' 65 

62 _ 59 

~ '65 54 

67'" 53 

70 52 

71 51 
" 73 "' "48"' 

76 ' 46 ' 
78 45 

' 81 ' 44 

" 8T 'irt 

~86 ' 42 

89 41 

90 40 

91 39 ' 

93 38 

'’96 37 


Rank Fre e 

100 m' 

105 " 35 " 

’T'rr ''i4' 

-ill. jf 

~121 3f 

127 iTi'' 

134 30 

139 29 

144 28 

149 27 

152 26 

162 25 

165 24 

172 23 

178 ' 22 

188 21 

19y|~20 

2^8 I 19 
.._1 

228 i? 

~245 'l6~ 

265 fs 
284 ’14 

_!'[([ "'ll 

337 12 " 

loF "1 f 

402 ' 10 
441 9__ 

501 ’S 


Log Log 
Rank Free 
2.00 1 .56 
2.02 1.54_^ 

'2.65" i'53 


2.64 0.95 
2.70 0.90" 
2.76 0.85 
2.8210.78 


1251 3 

1750 2 

2809 '"TF 


I'ahii- 4. liS: /.ij^fnin thilii Jiir llif h'xt (Tlw l.ihniry) 




With this framework and preliminary analysis, we proceeded for the regression and curr-e 

fitting for this text. 



Modified 1 loerl Model, .f = ah'x' with coefficient data a = 4769.28, b = 0.58 and 
c = - 1.05 fits this data 
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- " ' " frequency data obtained coefficients as 

a = 3.60 and b ” - 1 .00 . 1 he residual* plot, which shows the difference between the data 
points and the model, evaluated at the data points is shown below. 


Residuals 


r^T 


Fij>iirt‘ 4.24: Reskhtul Plot for data points & model in "The Librarv’ 


The result verified that Zipfs law is applicable in this text and for the Mandelbrot Zipf £ 
law ( g{r) = air + h f ), the coefficient c in this case is -1 .00 









Section JZ: Zipfs Law in Urdu Literature (Bisat-e-Hvdi>>^^ A,. // / 7 ; 

^ t tiyaei by Hyder Zciheer Ansari 

Hyder.) 


For .his section, we have taken an e-tex, front Ute English version of the collection of 

Ghazal “Bisat-e-Hyder” by Hyder Zaheer Ansari Hyder. 
dittp://wvvw.bisa!ehvder.indiaaccess. mm/ ) 

The choice of this text was done mainly due to the following reasons: 

. Ohazals is a genre of music or poetry, which is essenUally addressed to divine 
love. Two facets portray the ghazalst deep spirituality and passionate love. It is 

therefore very popular and representative of Urdu literature. 

Engh-sh tianslation of this te.xt was available so that analysis of the text was 
possible by I ext Stat. 'This text is easily available on the web and is popular also. 
A testimony in this regard is a mail from the then president of LISA, Mr. Bill 
Clinton, and “T hank you very much for your kind gift I appreciate your kind 
thoughtlulncss and generosity. You have my best wishes”. 

Fhe lollowing statistics were obtained for the text under con.sideralion. 


1 Stati.stic for (lie document 


[ N umber of words 

4035 

1 Number of sentences 

5’29 

1 Number olVords per sentence 

7.63 

j Number oi’ .syliables per word 
LOU’proxiniate) 

1.60 

[j'le.scii index'* 

63.63 


The Flesch index of readability for this document is 63.63%. This means it is a fairly easy 
document to understand and is of the level of ninth standard. The prominent words, 
which were obvious to get more occurrences in this text were, “dil”. “mujhko”, 
nohobbat , beat . dared , yadh "gum”, “hyder”, and “gazal” . Supporting words 


01 agi\eii duciiincn!, the Mcsch reiidabiliiy index is an integer intiicariiig how difficult the document is to 
, wifh lowt'r nuralHTs iiklicating greater difficiihy. 

Index = 206. 83 5 84 ^ 

words sentences 
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5*2.74111830 
r» 0.99592837 


"'‘"I 

, li 

5 

d 

i\) 



~ 233.2' 


3S8.6 466.3 


Figure 4.25: Piot of ranks vs. Frequency in Bisat-e-Hyder 
Hoeri Model, v = ah 'x^ with coefficient data, a = 151.06, b = 0.99 and c = -0.51 fitted 
this data in the manner which is shown in the graph given above. 


S « 0.10434031 
r* 0,97438883 


JO 1.4 1.9 2.4 It) j 

Log Rank j 


I I^lgiire 4 « 26 : Phi of log ranks vs. log frequency in Bisahe-Hyder 

llie Linear hit y = a+b.\ ior the log rank and log frequency data obtained coefficients as 
a - 2.4 j and b = - 0.8 1 . The residual plot, which shows the difference between the data 
points and the model, evaluated at the data points is shown below. 


residual 


'I'e predicted value a 


II poin k Ls rlefincti try Jtfsidualk = yi. - f(.\]() \Xdiere Vk is the measured value at .Vk, and f(xk) is 





Residual; 


Zip] s law is applicable in this text and for the Mandelbrot Zipf s 
the coefficient c in this case is - 0 . 81 . 


IMK 




I-<>g Rank 


7 : Residual Plot for data points & model in Bisat-e-Hvder 







Scctionjj : Zipl's I.aw in Sanskrit Literature ("Sri Vishnu Sahas 


aranaamam^'^ ") 


For this section we have taken, the Project Gutenberg E-Book of “Sri Vishnu 
Sahasranaamam”, by Unknown. It is in Sanskrit and character set encoding is US-ASCII. 
This E-lcxt was transcribed by N. Srinivasan and Karthik Krishnan and formatted by 
Maitri Venkat-Ramani. This e-texl can be transliterated in Sanskrit using the ITRANS 
processing tool til lilliirifanskm^ liiokfa,,,,.,...; 

The choice ofthis text was done mainly due to the following reasons; 

. Sanskrit is one of the 22 official languages of India. According to Wikipedia, 
Sanskrit is an Indo-European classical language of India and a liturgical 
language of Hinduism, Buddhism, and Jainism. It has a position in India and 
Southeast Asia similar to that ot Latin and Greek in Europe, and is a central part 
of Hindu tradition”. 

• .Sanskrit is uiustlj used as a ceremonial language in 1 lindu religious rituals in the 
forms mantras. T he text taken here signifies this as it is addressed to Lord Vishnu. 

• 1 he following statistics were obtained for the te.xt under consideration. 


1 .Stali.stic for the docuincnt 
iN’ uni her of wtirds 

1411 

N u m her k)f sentences 

283 

Nuniher ol' words per .sentence 

4.99 

! Number of syllables per word 

1 Tappj'o.xjmate) 

3.33 

; Mc.sch imiex^' 

-79.97 


The Mesch index of readability for this document is -79.97. This means it is a fairly 
Jitliculi document to understand. We have found 1248 distinct word. This was expected 
d.so as the tide under consideration is about synonyms ol’ the names of lord Vishnu. 
Hence the repetition of words was not expected. The repetitions which are present are 


01 a given docuiiicnf, fhc Mcscli .readability index is an integer indicating how difficult the document is to 
erstand, with lo\v*.n' ninnbers indicating greater difficidty. 

Index ^ 7m k > 5 ' 

words setUences 


100 


basically the connecting words or explanatory words. The prominent words, which got 
more occurrences in this text, were as follows: 

1 I 

---y- I 


Ya -n -J3 

Cha ' ^ 

j No ■ 

Sarva ' ~ 

Aum .j 

.Avs'ayah "4 

Na 4 

Naam 4 

Parainam 4 

j" Purushah ' 4^ 

I Vishnum 4~ 

j VLslmur 4 

_ _ _ 

Table 4.20: Prominent Sanskrit words & their frequency 
One can observe that supporting words like “ya”, “cha”, “no”, “aum”, “na”, “yo” and 
“sarva” obtained lot ol' occurrences. The ziplian data for the text is obtained and is 
presented on the table given on the next page. 


Rank 

Frecjuciicy 

Log 

Rank 

Log Free] 

i 

13 

0.00 

I.ll 


8 

0.30 

0.90 

J 

7 

0.48 

0.85 

5 

4 

0.70 

0.60 

14 

3 

1.15 

0.48 

23 

2 

1.36 

0.30 

! 10 

1 

2.04 

0.00 


I 1 I 1 1 

Table 4.21 : Zipfian data for the text f'Sri Vishnu Sahasranaamam ") 

'Vilh this framework and preliminary analysis, we proceeded for the regression and 
itting for this text. 






S» 0,51070786 ] 

r® 0.99381276 ’! 


40.4 (,0 5 


100.8 120.9 i 


Table 4.28: Plot of ranks vi. Frequency in Sri Vishnu Sahasranaamani 


Power Pit, y = nx’ with coctiicicnt data, a - 12.83 and b = -0.61 fitted this data in the 


manner which is shown in the graph given above. 


S « 0.05456221 
r» 0.99147504 




0.4 0.7 1.1 1.5 1.9 


Log rank 


Figure 4 . 29 : Plot of lay ranks vs. log frequency in Sri Vishnu Saha.'iranaamam 




The Ihnear l■'it \' a rbx Tor the log rank and log frequency data obtained coefficients as 
a = 1.07 and b - -0.54. Fhe residual'' plot, which shows the difference between the data 
points and the model, evaluated at the data points is shown below. Linear Fit: y=a+bx 


■ The residual ar poim k i.s defined b)' Residualk = Vk - ft-Kk) \XTiere Vk is the measured value at Xk, and f(Xk) is 
the predicted value at xl. 





Figuie 4.30. Ki^siduiil Plot for data points <6 model in Sri Vishnu Sahasronaantam 

The result verified that Zipf s law is not applicable in this text. For the Mandelbrot Zipf : 
law(^(/*) = a(r + by ) the coefficient c in this case is -0.54 (which is not close to unity). 



Si‘ction 9 : Zipl's law and I'lesch Readability Index 

Zipl* (1949) in l)i,s work. -Muraan Behavior and Ae principle of least effort” viewed 
language us a "tool" that is shaped by its "jobs" in human society. Other works of Zipf 
were -Selective Studies and the Principle of Relative Frequency in Language” '<■ which as 
published in 1932 and -Psycho-Biology of Languages” which was published in 1935. 

Many yetus alter ins death linguistics agreed that .speakers .simplify communication by 
using a small pool of words that they can retrieve quickly from their memory and 
lislencrs siraplily communicat.on by preferring words with a single and unambiguous 
meaning, llns proved that Zipt's law is applicable in undetstanding human language. 

Zipf searched for a principle of least effort that would explain the equilibrium between 
uniformity and diversity in usage of words. Most others searched for a probabilistic 
explanation. The burning question still remains- Do we have any new evidence that 

Zip! s explanation oi principle ot least effort is more correct than a statistical 
explaiiiilion'.’ 

Flesch Readability Index' on olhe'r hand has become a sort ol a standard as far as the 
readability ol tin documents is concerned. At many places, it has become imperative to 
ascertain that the document/ lorms have a lairly high value of Flesch Readability Index, 
so that it is understood by masses. 

In this section, we have tried to investigate whether there is any relation between the 
Zipl s principic rd least cllorl and the readability of the document. 


Zipfs Law 

Zipl formulated a law in 1 930 that says frequency count (number of occurrence) of words 
>n any text is inversely proportional to the rank of that word. In other words, the 
distribution ol words adhered to a regular statistical pattern or “The probability of 
occurrence ot words or other items starts high and tapers off exponentially. Thus, a few 
occur very often w hile many others occur rarely” (Black'*, 2000). 

lo luither explain the basic form olThc law, 

frequency * rank has a inversely proportional relationship: 
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frequency * rank = constant or f * r = c 

Zipfatlribiilcd this hiu as a consaiucnca of "l-rinciplc of Least Effort". The Principle of 
Least Effort postulates that a person would like to communicate in such a way as to 
Oirtimto his total effort. Altmann'’ (2002) commented that Zipfs ideas are the 
foundation stones of , modem quantitative linguistics and his influence is not restricted to 
linguistics but inces.sanlly penetrates other sciences. Mandelbrot'* (1953) tried to discuss 
Zipfs law in terms of communication costs .and explained that the communication costs 
increases as the number of words and their length grows. Ferrer-i-Cancho & Sole*"* 
(3001a) commented that many models ol syntactic communication assume this law. It is 
an obvious ingiedicnt for any theory of iaiguage evolution. According to Li“’ (2002), the 
number of times a word is used in written human languages and the frequency of usage 
are the vaiidbles that indulge in a Zipfs type distribution. Smith & Devine‘S (1985) 
found that legal te.xts also follows Zipfs law but in a little different manner. Francis & 
Kuccra’^ (1964) applied tlie Zipfs law to the Brown corpus of 1 million words of 
American English. Le Quan et al. (2002) analyzed Zipfs law for large corpora in 
two languages, English (from the Wall Street journal) and Mandarin (from the People’s 
Daily Newspaper and the Xinhua News Agency. Wang^^ (1989) presented Zipfs 
distrilnitioM of ('iiiiiesc corinis and Wyllys’" (1981) took a data .set ol' 3907 English 
words. Sun^^ et al. (1999) commented, “Studies of word frequency have many interesting 
and potentially signillcant applications. For example this model could be used to evaluate 
a single article or an author’s work. Assuming a reasonable level of skill among the 
writers whose works are the basis for our observations, we can use this model as a 
benchmark for assessing writer’s language skills”. Gelbukh & Sidorov^* (2001) observed 
that the coefficients of Zipf law are different for different languages. Ferrer-I-Cancho & 
Sole (2001b) show'ed that the co-occurrence of words in sentences relies on the network 
structure of the lexicon. They analyzed the properties in depth and commented that 
human language can be described in terms of a graph of word interactions. 

fiesch Readability lnde.\^“ 

p • 

or a given document, the Fiesch readability index is an integer indicating how difficult 
the document is to understand, with lower numbers indicating greater difficulty. 


105 




Flesch Imiex - 2116.835 - 84 . 6 * 


svliahles 


sentences 


According to Wikipedia, the free encyclopedia, a syllable Is a unit of organization for a 
sequence ol .speech sound.s. Syllables are often considered the phonological "building 

blocks" of words. They can influence the rh>ihm of a language. 

Flesch readability index can be related to the educational level of the audience. For 
e.xaniide a .score of')! - i 00 can be easily comprehensible by a 5'" grade student, a score of 
51-60 iindeistandablc b) a High School student, a college graduate will be able to 
comprehend a document with score 31-50 and a document with score less than 0 can be 
understood b\’ a I.aw School Graduate only. 

Research Question 

11 ti document has high idcsch Readability Index, then whether the Zipfls curve will fit 
this document in a bclici manner. In other words, if a document is fairly easy to 
understand, then whether it will follow the Zipfian distribution? Whether Zipf s law is 
applicable in understanding the human hinguage? Can it be used tis benchmark for 
a.s,scssing a wrilcr’.s skill? 

Findings 

Appendix I illustrates the documents with related statistics on number of words in the 
document. Flesch Readability Index, Zipf s coefficient, number of sentences, number of 
syllables per word and the number of words per sentence. On the basis of this, we tried to 
group documents primarily with respect to values of the Flesch Readability Index and the 
Zipfs coefllcicnt. 

Type I documents were those with a very high negative Flesch Readability Index and also 
a poor Zipf s coefficient. This was understandable as first one was a group of German 
words taken from English-German Business dictionary. The second one was “slokas” 
from Sanskrit language that follows a very different style. These documents thus 
possessed very less words per sentence and more syllables per sentence, resulting in 
highly negative values of Flesch Readability Index. 
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Type II document was English words taken from English-German Business dictionary 
We expected that with in.provement in Flesch Readability Index, the performance of 
Zipfs eoeindent will also improve, but that did not happen. Type III document was a 
classic story Irom Hindi literature. It although have a low Flesch Readability Index but 
had an improx-ed Zipfs coefficient as compared to type I and II documents. 

Type IV documents had excellent value of Zipfs coefficient and also good readability 
(uaderslandable by a high school level reader). This tend to show that Zipfs law is 
applicable in coeumenis tlial have on an average l..<i syllables per word and have 5-S 

words per sentence. 

Type V documents however nullified the claim that was found in type IV documents. 
Almost all these documents have Zipfs coefficient ranging from -1.20 to -1.37, but had 
variable readability inde.xes ranging from 46-80. No trend has either been found in the 
syllables per word and words per sentence. 

Mgiiie 1 shows the i elution between Zipl s coefllcients and Flesch Readability Index. 
One can easily visualize the type of documents here. Can we conclude that if a document 
has a highly negative readability index it is bound to have a bad Zipfian fit? The curve 
that fitted this type of distribution is Sinu.soidal Fit: y=a+b*cos{cx+d) with coefficient 

data a =-0.83, b =0.40, c =0.03 and d =1,31 with standard error = 0.16 and correlation 
coefficient = 0.79. 


Relationship between Zipfs Law & Flesch Index 

1 




i C 

i u 

E 

: S 

i o 

I ^ 

I a 

I M 



: «@§.2 - 66.4 43.7 - 0.9 31.8 . 64.6 97.4 

i Flesch Index y 

u'gure 4.31 ; Showing relation between Zipfs coefficients and Flesch Readability Index. 
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Conclusion 

Allhough we have tried ,t> make the sample as diverse as possible, we finally found that 
21 documents are belonging u, type V. These documents have different readability 
indices, belong to a dillercnl genre and belong to different time periods but have almost 

dmilar value fur the Zipfs eoeffieient. This indieales that rcadabilily has little to ,l„ with 

the Zipfs cuclficicius. 

This led us gu back U) our research question that if a document has a high Flesch 
Readability index, liien whether the Zipfs curve will fit this document in a better 
niannei. lype 1\ documents pattially demonstrate this as they have excellent value of 
Zipl s coclficicnt and also good leadabihty. Type V documents also have Zipfs 
coclllcienls that are not too had but these coelficients are constant w'hile readability 
varies from 46 to 80, Type 1 documents however proved that a poor Zipfs coefficient 
iiuiicales hip.li nei'.alis e 1 lesch Reaiiahility liulex. 

Coming to the next research question, whether Zipf s law is applicable in understanding 
the human language? Can it be used as benchmark for assessing a writer’s skill? The 
limlings in thi; comimiiiictilion reliitc this claim. It is hccaii.se ol' the fact tluil Zipfs 
principle ol least cifort says that a writer simplifies communication by using a small pool 
ol words from their memory. Thi.s would mean that these communications ought to have 
good readability indices too. So, all those documents that have good readability 
coefficients should have good Zipfs coefficient also. This however is not reflected in the 
findings. Hence the contradiction that Zipfs law is applicable in understanding human 
language. This is also contradictory to Sun"^ et al. (1999) comment that “we can use this 
model as a benchmark for assessing writer’s language skills”. 

In conclusion, we can say that probably more data-sets need to be taken to formalize 
llicsc fiiKiing.s. 
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Documeni SkUis ik s ami Zipjy Lmv 

|*ik‘ Nsiiiii* 


Eng-ger4iiisDiciioiiary.txt 
saiiskrilwork.ixt 
Eng-ger-busDictionary.txt 
eidgaah.txt 
libran s.ixl 
aladdiii tcr.lxl 
aladdiii ciig.lx! 
iirdii.lxt 

jeffersoii-aiitobiograph)-73.txi 

wollstonecrafkmaria* 1 96.1x1 

franklin-aiitobiography- 

244.txt 

cliaucer-canierbiiry- 1 02.txi 

augusti!ie-cc)rifessions-276jxl 

mill-sybjectioii-2 1 7.lxl 

Arabian iiigfils 
enteitaiiimentsdxt 

arislolle-ineleorology-SO.txi 

frciid-ycHiiig-»763.lxt 

bcrkcley-trealisc- 1 77,lxt 

locke-concerniiig-ri ! 4x1 

barrie-peler-277.txt 

bunyan-pi!gr.iiiis-3()4.lxt I 

anonyiiioiis-bcowul 1-543 .1X1 

dickens- christmas- 1 25.txt 

hiroshima iiagasaki.txt 

twaiii-tom-40.txt 

lu cretius-on-39S.txt 

keats-endymion-484.txt 

36 5 foriegn dishes.txt 

The arctic qu ecn.lxt 

shakespeare-haiii let-2 5 .txt 

shakespeare-romeo-48.lxl 


9107 
1411 
: 10089 

I 

37498 

i 7686 

5 319 

j .1 035 
j 40648 
45874 

68157 

I 99 4Q3 
176014 
45240 

90768 

43470 
72133 
] 36342 
53786 
47885 
' 57122 
27129 
21818 
25341 
24486 
75386 
31962 
27891 
16703 
33098 
26784 


Index 

ZipPs 

Coefficient 

No- of 
Sentences 

Sylhililes 

per 

word 

Word 

per 

sentence 

- 82,83 

- 0.63 

5792 

3.41 

1.57 

- 79,97 

- 0.54 

283 

3.33 

4.99 

- 9,95 

- 0.66 

5763 

2.54 

1.75 

28.74 

- 0.82 

505 

1.99 

9.8 

53. -3 

-1 

5037 

1.73 

7.44 

57.') 

- 0.92 

.1536 

1.7 

5 

68.: 3 

- 0.92 

661 

1.54 

8.05 

63.63 

- 0.81 

529 

1.6 

7.63 

48.76 

• 1.37 

5326 

1.78 

7.63 

52.75 

- 1.37 

5426 

1.72 

8.45 

57.27 

- 1.37 

7270 

1.66 

9,38 

69.18 

- 1.35 

13578 

1.54 

7.32 

69.99 

- 1.35 

22974 

1.53 

7.66 

46.75 

- 1.33 

5108 

1.79 

8.86 

62.41 

- 1.33 

10672 

1.61 

8.51 

60.99 

- 1.3 

5030 

1.62 

8.64 

74.4 

- 1.3 

11160 

1.49 

6.46 

52.17 

- 1.29 

4115 

1.72 

8.83 

56.56 

- 1.29 

5732 

1.66 

9.38 

71.56 

- 1.29 

6906 

1.52 

6.93 

73.73 

- 1.28 

7241 

1.48 

7.89 

71.35 

-1.27 

4173 

1.52 

6.5 

67.75 

- 1.25 

3301 

1.56 

6.61 

46.75 

- 1.24 

3313 

• 1.8 

7.65 

80.99 

- 1.24 

3564 

1.41 

6.87 

65.78 

- 1.23 

10549 

1,58 

7.15 

68.19 

- 1.23 

4847 

1.56 

6.59 

72.03 

- 1.22 

4424 

1.52 

6.3 

62.09 

- 1.21 

2451 

1.63 

6.81 

70.42 

- 1.2 

4931 

1.53 

6.71 

71.97 

- 1.15 

3854 

1.51 

6.95 




Section 10: Zipf.'- l-au and Principle of Least effort 


Zipf attributed iiis law as a consequence of “Principle of Least Effort". The Principle of 
Least Elfort postulates tliat a person would like to communicate in such a way as to 
minimize his total ellort. in other words, a person will tend to “minimize” the probable 
average of his work-e.xpenditure (over time), meaning use of least amount of work. 
Principle ol Least Ellbii is relevant even today. However, it was criticized by Rapaport^* 
(1957) on tile basis that although Zipfs arguments are plausible in a great variety of 
situations, thes' are not suitable for generalizations. 


Zipl^ (1949) in his work. “Munian Behavior and the principle of least effort” viewed 


language as a "tool" that is shaped by its "jobs" in human society. The purpose of this 
book, which was an introduction to human ecology, “is to establish the Principle of least 
effort as the primary principle that governs our entire individual and collective behavior 


of all .sorts”. 


Chai Kim’" (1982) iiucstigaled the extent to which the principle of least effort as 
advanced b>' Zipf provided a theoretical basis for identifying and updating descriptors of 
sciencc/tcchnology and social .sciences. He found that “the relative frequency of 
occurrence of tf e descriptors of social sciences conformed to the theoretical distribution 
of Zipf while that of sciencc/technology did not”. 


In this section we will try to view the above facts on the basis of the 31 documents that 


we have analyzed Ibr checking the robustness of Zipf s law. We have thus collected data 
on the following parameters: 


Vord: This is the total number of words in a document 


Sum oj frequcmy: .Sum t)f frequencies is the .sum ol rank Ircqucncics ol the 
Ziplian data of the document. When shown as a percentage of the total words it 


reveals the comauf/o of the document. 


Unique wotxlx : This is the count of unique words in a document. 


Least effort % : This is the ratio (percentage) of the unique words to the total 
number of words in the data. It reveals the “effort” that the writer has done in 



1 


c„m,m,mc«iny his ideas. The smaller ,he percemage the less is the effort 


of the 


writer. 


Canum%: This is the ratio (percentage) of the sum of rank frequencies of the 
Ziplian data oi the document to the total number of words in the data. It reveals 
the atnount by which the Ziplian data is able to capture the document. The higher 
the percentage the more is the better the containment of document in the Ziplian 


data. 


. miOJmbr . For a given document, the Flesch readability index is an integer 
indicating how difficult the document is to understand, with lower numbers 

indicating greater difficulty. 


Flesch Index = 206.825 -U -1,015* 

words sentences 


• h icl'ers to Mandelbrot generalization of Zipfs law that the 

slope is -!. Mjndelbiot (1952, 1964) assumed that the aim of language is to 
transmit the most information per symbol with the least effort. It is expressed by 

tile relationship ./(/') = A' (r + c) ^ where, /(r) is the rank frequency and r is the 
rank of the word, ‘c’ and ‘0’ are constants, ‘c’ improves the fit for small r and the 
exponent ‘O’ improves the fit Ibr large r. Here the Zipfs coefficient refers to the 
exponent ’0’. 


Now with this data in hand, we tried to apply factor analysis. Factor analysis attempts to 
identify underlying variables, or factors, that explain the pattern of correlations within a 
set of observed variables. Factor analysis is often used in data reduction, by identifying a 
small number of factors, which explain most of the variance observed in a much larger 
number oi manifest variables. 






ill 




File Name 

;UlgllSliliC-Cl)nlCSSitHls-. 176 .[M 

365 fbricgn dishes. ixi 
ireu(j-young-763.t.xt 
Arabian nigfHs cntcnainnrcnls.t.xi 
ari.stnlle-)iie!ei>ro!ug\-80,[.\i 
bunyan-pilgnms-3()4.txt 
libraiys.tx! 

locke-coneerning- 1 1 !.t\i 
saiiskiKwink.iM 
bcrkelev -treali.se- 1 77.t.\t 
cbaiicer-cantcrbury - i()2.l\t 
aiaddin eer.l.M 

fraiikliii-aiitobiogr.iph\-244.i.\t 
aliuiiiin eni! Ixl 
|i.icreiius-oii-3b5.txt 
barrie*pctcr-277.ixt 
lwain-i<iiii-40.i\i 
urdu.t.vt 

niill-s ubjection-2 1 7.t.xt 
woilslonecnifl-maria- 1 <)6,t xt 
sliakespeaie Kiiiieo-dK.lxl 

jeirerson-auto biography-73.t.\t 
liirasliima n igasaki.lx! 

!tluikespe;irc-iiaiiiici-25.!xl 

anonymou s-beovvuil-543.txt 

Eng- ger-bii.sDiclionary.txt 

Eng-gcr-busl )iciionar\.lxi 
clickens -chri.sliiias-125.txt 1 

keats - en dy m i on-4 84 . t.v t 
The arct ic queen. txi ! 

eidaa ah.ixt 

Tabic 4.23: Least effort percentage c 


No- of 

1 Words 

Sum 

(freq) 

Unique 

words 

Least 
effort % 

Contaiii% 

1 i /K20 

1 27891 [765^ 

9629 

1581 

5.47 

5.6?”™ 

66.94 

59 47 

i 72133 

47288 

4486 

6.22 

65.56 

I 90768 

“ 

55204 

6464 

7.12 ~ 

60.82 

; 43470 

27171 

3186 

7.33 

62.51 

j 57122 

36420 

4274 

7.48 

63.76 

J 37498 

18037 

2809 

7.49 

48.10 

i 53786 

34675 

4169 

7.75 

64,47 f 


j 14 11 ! ^^38 

I ‘J‘W03 j 
j 17086 j 723^ 
^0815 7 , 391 54 
I 5.3 1 2521 
75386 41078 

47885 28498 

24486 I 14493 
4035 1280 

45240 26259 

45H74 2.5319 

26784 j*13ll 
40648 I 21556 

L 


1 27129 12315 3744 
j 10089 I 1690 1402 
I 9 107 645 1378 
I 21818 10723 3695 
[31962 14182 5521 
j 16703 6776 3482 
! 4951 1698 1497 


S8.45 2^6 9 

8.87 j 60.15~ 
8.91 r ~6Q.25 

9.23 I 40.90 


Flesch 

Index 

69.99 
72.03 
_ 74.4 
62.41 
_60.99 

53.33 
56..56 
•79.97 ~ 
52.T7~~™ 
69.18 
57.9 
57.27 
68.23 ~] 
65.78 


4788 

10.00 

.59.5 1 

71.56 

2455 

10.03* 

59 jo” 

80.99 

424 

10.51 

31.72 

63.63 

4885 

10.80 

58.04 

46.75 

5940 

12,95 

55 19 

52,75 

3.541 j 

13.22 

53.43 

71.97 

5497 j 

13.52 

53.03 

48.76 

3448 j 

13.61 

48.02 

46.75 

4542 

13.72 

55.18 

70.42 

3744 

13.80 

45.39 

71.35 

1402 

13.90 

16.75 

-9.95 

1378 

15.13 

7.08 

-82.83 

3695 

16.94 

49.15 

67.75 

5521 

17.27 

44.37 

68.19 

3482 

20.85 

40.57 

62.09 

1497 

30.24 

34,30 

28.74 




c • 3, Least effort percentage and contain percentage of the documents 

llie following descriptive statistics was obtained when we proceeded for the factor 
analysis with the four variables; least effort % (LE_PER); contain%( DEF_PER); Flesch 
Index (F__INDEX) & Zipl's Coefficient {ZIPF_C). 


! F INDHX 
! LE p!-:r 


' ZIPl- c 


Mean 


I 49,3219 
: 50.7394 


14.0561 


; -1.1535 


Std. Deviation 


16.2034 


39.0706 


14.6947 


.2388 


Ta hlc 4.24 : Descriptive Statistics ofVariahleT' 

The Correlation Matrix is obtained suggests that there is strong negative correlation 
liciween Contain and tlic /ipfs coeflieient, on the other hand there is a strong 

correlation bcts.ccn Coniain% and the I'iesch readability inde.\. There is a weak 
correlation between least effort % and the Zifif s law. 



DEF_PER 

FJNDEX 

LE PER 

ZIPF C 

DLil<„PER 

1.000 

.849 

-.661 

-.907 

FJNDEX 

.849 

1.000 

-.667 

-.752 

j 

-.661 

-.667 

1.000 

.556 

' ZIPF C i 

-.907 

-.752 

.556 

1.000 


Tahk* 4.25: Correlatiott matrix of variables 

Bartlett's lest ol .sphcricits' tests whether the correlation matrix is an identity matrix, 
which would indicate that the factor model is inappropriate. So the null hypothesis that 
the correlation matrix is an identity matrix is tested and the following results were 
obtained { %“= 102.21, df-* 6, pO.OO). The p-values found here is significant hence we 
reject the null Inpothesis that the inter-correlation matrix comes from a population in 
which the variables are non-collinear (i.e. an identity matrix). The Kaiser-Meyer-Olkin 
(KMO) measure of sampling adequacy tests whether the partial correlations among 
variables are small. 

The KMO statistic is 0.769 that means we can conclude that the degree of common 
variance among the variables is all right and the factors extracted will account for fare 

ot variance. ;; 

have also calculated communalities, which is the proportion of the total variance of a 
variable accounted for by the common factors in a factor analysis. All the variables have 
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eommmidity alx.vc 0.95 will, LHJ'ER having fte highest communality (1.00) and 
DEF PER having ,hc h,uc.sl (0.955). The extraction method was based on Principal 

Component Analysis. 


I Ota! V ariance Explained 



Initial Eigenvalues 

_ 

Extraction Sums of Squared Loadings 

Component 

i 

Total 

% of 

’ Varituicc 

Cumulative 
!.- 

Total 

%pf 

Variance 

Cumulative % 

1 

3.209 

■ m:2?2 

80.232 

3.209 

80.232 

80.232 

„ 

.491 

' 12.274 

92.506 

.491 

12,274 

92.506 

3 


; 5.737 ! 

< ' 

98.243 

.229 

5.737 

98.243 

4 , 0.007028 i 1.757 

100.000 





Extraction Method; Principal Component Analysis 


Tabic 4.26: Caiiipofie/ilx and % of variance they explain in Factor Anaiysis 
The following factors were obtained: 


(3oniponent Matrix 



Coiiipoiiciil 

1 

2 

3 

DEF^PER 

.9614- 

.171 

-5.171E-02 

FJNDEX 

.9164- 

1.269E-02 

.396 

zipfJ’ 

-.905 

-.326 

.2304- 

LE_PER : 

-.192 

.5964- 

.132 


Extraction .MethtKi; i’rincipa! Component Analysis. 

Table 4.27: Factors obtained by Principai Component Anaiysis 

lactorl J’jiR & i'_iNi)i{X) has 2 variables. Factor 2 (LE_PER) has one variable 
and factor 3 (/!P1'_C) hits ] \ariable. The first two factors are explaining almost 92% of 
the total variance, i'he following Scree plot has been obtained: 
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Component Number 


Figure 4.32; SCREE Plot offactoi 
Now we arc able to cirnimcnt that contain% and the Fles 
^ will! ihc least cllbrl‘'u arc c\j)laining 92% ol' the varia 
i coefficient is a redundant data here. To probe more into this 
I predictor variables in a multiple regre.ssion analysis to pr 
' Zipfscoefllcient. 


.Model Summarj' 


Change Stati 


Estimate 


Change 


Predictors: (.Constant), DEF_PER 


Dependent Variable: ZIPF C 




Model 


1 1 ss 

ANOVA 

df 


i Regre.ssion , 1.407 

1 Residual ; ,3”()3 

1 ' 
! 29 

_ Mean Square 
1407~" ' 

1 Hi An A'’) 

jjplal ___ f 

30 




134.449 




.000 



j Unstandardized 
! Coefficients 

Standardized 

Coefficients 

t 

"sTg." 

Collinearity 

5st.qti ctiVc 

Model 

1 

i : Std. ; 

■' ! Error I 

Beta 


1 

i 

Tolerance 

vTf 

1 

(tonsiaiiii . “.Hvo i .060 1 

DEF.PER -!.3~36E:-02 ^001' 


''uri 

-11.59 

.000 

.000 

1.000 

1.000 


Model 

1 ' 

1 Beta in 

! 

' lEJfiiR ! -.078 1 

-.744 ! 

Sig. 

* T aiut UlCS 

Partial 

Correlation 

Collinearity Statistics 

Tolerance 

VIF 

Minimum 

Tolerance 

.463 

"-Ji' 39 " 

.563 

^1.778 

.563 


f__JNOfcA 1 .063 j 

.422 .676 1 

— 

0 

00 

0 

.280 

3.572 

280 

a 1 rcdictors in iJie Moclel: {Coiislaiit), DEF PER —i 




Dependent Variable: ZIPF C 


laliie 4.29: Multiple Rey.ression Amilysis (ANOVA Table) 

The model found in the anal} si.s .sugge.si.s that DEF_PER alone explains 82.3% of Zipfs 

coefficient. Adjusted R‘ for tiie model is 0.816. R’ is adjusted to reflect the model’s 

goodness ol lit for the population. The net effect of this adjustment is to reduce R^ from 

0.823 to 0.8 1 6. thereby making it comparable to other R-’s in case other models are also 
found. 

iandaid euoi ol the model is 0. ! 023. This is the standard deviation of actual values of Y 
< 01 (he c.siiinatcti 'i v alues. .'Vnalysis of Variance measures whether or not the equation 
«pi<..scn(s a set of regression coefficients that, in total, are statistically significant from 
T^hc ciitical value for I- is found to be 134.44, which is significant at less than 0.05 
ol significance at I, 29 df. Regression Coefficients for the model and the 
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unstandardized regression coefficients for the 

, ^ given in the table. The 

equation may be conslrnctcd as 


ZIPI' ( = -.496 -0.0l.T16 DHF PTR 
Or 

ZiPl-_C= -0.907 DEF PER 


Since variability inllalion factor (Vli') k ! o . n- • . 

i “Hs ''‘b nuillicollineanly ,s not a problem. The 

-niK- K. 

.dl „y .Ik- v.„u.1,L. ou j-liR u, .ho o„l, ..d ,l.e variable LH PER or leas, 

effort is not required. 


lie resell llru.s ea.„e close ,o lire finding of Rapapor,^' (,957) ,ha, alfirough Zipf s 
srsumenls arc plau.sible in a grea, variety of sitnations, they are not suitable for 
generalizations. It also supported Chai Kim“ (1982) that -the relative ftequency of 
Miirrence of the dcseriptors of social sciences conformed to the theoretical distribution 
of Zipf while Ihal of scicttec/tcchnology did not". In this data, documents like Fn,,ltsh 

Oemtan business dictionary, Sanskrit te.xt and Urdu text behave differently than the 

Zipfian di.stribulion. 
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Oiscussion 


According to (icorge Ikniiard Shaw'. Irish literary Critic and Essayist wiio won the 
Nobel Prize for Literature in 1925, 

‘■Axni' opinions ofien appear first as Jokes ami fimcies. then as 

blasphemies and treason, then as questions open to discussion and 
Jhiii/iy iis e.siahiisheii triiihs ", 

When we started working on this pivhlcm. the work looked like a nianimoth task. 
There are several reasons for this, i-irst and foremost was the popularity of Zipfs law 
and its applicabilit} in plethora ot activities. Secondly, there were a gigantic number 
ofe-xceptionall} cinincnl people who have worked on works of Zipf. 

We started by setting \ ery mode.st goals for u.s the.se objectives are as follows: 

• lo find the interrelationships betw een the rank and the frequency of a word in 
selected literatures. 

• !o lest whcllier (he /ijil’s law can he iipjilied in these lilcniturcs. 

• 1 0 do an inter-literature comparison of the applicability of Zipfs law. 

• lodo inatheinalical inodeih'ng & validation (.if the model througli the collected data. 
And we set the following hypothe.sis 

L Ihe piinciple ot least eliort is a univei'sai phenomenon. 

2. All writers would follow an economy in the use of words irrespective of the 
languages concerned. 


1 he rank-frequenev' distribution of words would be similar in all languages. 

bet us discuss the anals sis of the documents that we have collected for this work. For 

tins task we have divided the discussion in three sections that addresses to the 
objectives mentioned abox’c. 




interrelationships between the rank and the frequency 


riiis section is utnoted to tiie first objective that was to find the interrelationships 
between the rank and the Irequency ol a word in selected literatures. To probe into this, 
the documents svere subjected to curve fitting. The Zipfs coefficient was obtained by 
fitting Zipf-Mandelbrot equation or the shifted power curve. Simultaneously, rank and 
frequency data was also subjected to curve fitting and we obtained the following major 
categories distributions that were able to describe the rank-lVciiuency distribution of 
the documents. 

• Yield density family ~ Bleasdale and Harris Models 

• Power Law family - Hoerl, Modified Hoerl, Power models 

• Exponential family - Vapour Pressure models 

• Sigmoichil famih’ - MME & Weibull models 


‘\s wc can .see loiir mouct mmiues were u.seu and all were able to explain around 97% - 
99% of lire variance, i'he big question w hich automatically arises is: - How to compare 
the appropriateness for the different type of functions fitted to this data. Should one just 
fit all the commonly used funciinns and see which one fits the data "tiic best". A good 
analysis requires robust techniques in assessing and empirically developing the model. 
The data is never wrong and thus "Statistics" should "speak" for the data. One should not 
lead by assumption but should try empirical evidence. The data itself suggests as to how 
& in what form the mode! is to be used. In summary, vve have not adhered to pre- 
specifying the model but tried to develop the model by keeping it simple and 
parsimonious. Ixh' body can claim that a particular model is the true equation of the data in 
question as the ir le equation is only known to GOD and hence it is said tbst "All mndpls 


I'Ollowing IS the classification of documents according to the di.strihution 
gave good parameter \'alues and fit statistics, but nobody was able tc 
generalized rule or procedure. 

The yield-density models are widely used to model the relationship betwe 
erop and the SDucinu or densitv nr nlnntinc? If thp rf*cnnncA tc ciir»K 


increases, but the > ieki (y j approaches a fixed value, the relationship is asymptotic. If the 
response is such that there i.s a distinct optimum as the density increases, the relationship 
is parabolic. 

The documents like Aladdin (English), Aladdin (German), and Engiish-German-Business 
Dictionary (English & German words) were the four documents that showed behavior 
like this family, following are the lit parametens for this type. 

-I 

Bleasdale .Model, y = ia i- hx) ^ 


File Name 

1 Sth Clf 

1 Worci.s 

j , , . 

1 Unique 
; %%'circls 

Zipfs 

Coeff. 

: Relationship 
befweeii Rank & 
Frequency 

a 

b 

" ' 

C 

ala(Jdin 

eng.lxl 

1 

53l<l 

1 

» - j 

.s24 

■0.92 

Bleasdale Model 

0.15 

0.01 

0.36 

alatidiii I 

ger.txl ! 

i 

i 1 

1 16m \ 

* . . . . i . . . 

1633 : 

. . 1 1 . 

1 

-0,92 

Bleasdale Mode! 

i 

0,06 

0.00 

0.47 

1 


Harris Model, y 


1 


a + hx' 


i 1 

I'lIcName No- <»f : I niquc j Zipfs 

1 Words vvord,s Coeff. 

1 

. Rchitioiisliip hctweeii 
Rank & Frequency 

1 

1 

a 

! 

1 

c 

bu^feioiKiry.txi : : 14ii2,(iu ; -0.66 

Harris Mudcl 

0.12 

0.12 

0.09 

Eim-ecr- ' ; ' 

busDiclionarv.txi ^ ^ i^'S-OO j -U,63 ! 

: ^ 1 1 

r rs, - - 

Harris Model 1 -O.OI 

0.02 

0.43 


The second chess of disiriluuion ua.s Sigmoidal Family. These "S-shaped" growth curves 
re common in a wide vaiictv ul application.^ such as biology, engineering, agriculture, 
economies. 1 hc.se eur\ es siart at a li.xcd point and increase ihcir growth rate 
noionicalK to icach an inllection point. Alter this, the growth rate approaches a final 
value asymptoticaliy. This family is actually a subset of the Growth Family, but is 
parated because oi their di.stinctive behavior. Many documents were found to adhere to 
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this family of distributions. As many as 1 6 documents (out of 3 ! ) wee found to come 
from this family- Fit parameters for these documents are given below: 


Weibull Model, y = a-h e'" 


File Name 

No- of 

Words 

: Unique 

I words 

Zipfs 

Coeff. 

Relationslilp 
between Rank 
& Frequency 

a 

b 

c 

d 

jelTerson- 

autobiography- 

73.1x1 

4()64S 

5497 

; -IJ7 

Weibull Model 

4519.53 

4534,42 

1.23 

-1.04 

woiisloiiccrall- 

maria-l96.txl 

45874 



5940 

-1.37 

Weibiill Model 

2551.46 

2592.5! 

3.08 

-1.02 

augusllnc- 

confessions- 

2%m 

176014 

9629 

-1.35 

Weibul! Model 

10151.8! 

10324.38 

1.76 

-0.81 

berkeley- 
treatise- 17 7. txi 

36342 

7222 

-1.29 

Weibull Mode! 

21 1 1.22 

2180.72 

2.16 

-0.83 

locke- 

concerning- 

lll.txt 

53786 

4 1 69 

-1.29 

Weibull Model 

3967.59 

4007.99 

1.80 

-0.92 

bunyan- 

pilgrims-304.!xt : 

57122 : 

i 

42:74 

-1,28 

j 

Weibull Model 

2487.01 

3081.75 

2.13 

-0.80 

anonymous- i 

beowulf^543.txt 

1 

.27129 ! 

1 

^ 3744 

1 : 

1 -1.27 

i 

1 

1 

i Wtibuli Model 

7606.39 

7633.55 

0.28 

-0.84 

keats- 

Ciulyinion- 

484.1x1 

3I%2 i 

\ 

\ 

552 1 

1 

i 

^ -1.23 

Weibull Model 

1318.18 

1341.75 

2.81 

-0.95 

The arctic 
queen.txi 

16703 j 

3482 

j 

i -I.2I 

Weibull Model 

3311.38 

3345.22 

I 

0.32 

-0.72 

shakespeare- 

ronieo-48.txt 

26784 

3541 

-1.15 

Weibull Model 

830.34 

940.91 

j 

2.77 

-0.63 

freud-youna- 

'763.txt 

72133 

4486 

-1.30 

Weibull Model 

2381.06 

2467.13 

5.22 

-0.92 


hiblc5.3: Docitmeins where the rank cfe frequency relationship followed Weibull model 



MMF Model, }' 


ah + c'.v‘ 


File Name 

No- < ■ 

W’orcis 

Unique 

words 

ZipPs 

Cocff. 

Relationship 
between Rank 
& Frequency 

a 

b 

C 

d 

franklin- 

autobiography- 

244.txt 

68157 

6496 

-1.37 

MMF Model 

4301.03 

3.06 

-43.47 

1.10 

- 

cliaucer- 

canterbury-102.t.xt 

99403 

S854 

-1.35 

MMF Model 

5528.24 

3. OS 

-106.65 

0.95 

Arabian nights i 

entertainments.txt : 

90768 ' 

6464 

i 

-1,33 1 

MMF Model 

i 

1 

77313.18 

0.09 

-122.78 

0.74 

clickens-christiiias- ■ 

I25.txt 

218! 8.00 

; 3695.00 

1 

I -1.25 

! 

1 

1 

I 

i MMF Model: 

1 

4653.87 

0.36 

-68.54 

0.68 

shakcsjHMi'C” 

! 

,0UI‘IK l»l 

i 4:'4.’.00 

1.. . ... L Li 

I -1.20 

i 

! 

1 

MMi- Model: 

1674.85 

2.35 

(17 

-93.73 

0.75 


The next famil} was the cxpv>ncnlial models. Ihesc models have the exponential or 
logarithmic functions involved. I hey are generally convex or concave curves, but some 
I models in this group are able to ha\e an inllection point and a maximum or minimum. 

^ Only two documents fall under this category. T he lit parameters for these documents are 

given below: 


Vapor Pressure Model, y = e ' 


File Name 

No- of 
Words 

U 

words 

Zipfs 

Coeff. 

Rehiliofisliip 
between Rank & 
Frequency 

a 

b 

€ 

lucretius-on- 
395.1x1 ' 

75386 

7446 

-1.23 

Vapor Pressure 

Model 

9.04 

-0.62 

-l.OI 

365 foriegn 
dishes.txt 

1 

!789l 

1581 

1 

-1,22 

Vapor Pressure 

Model 

8.76 

-1.37 

-1.17 


Table 5.5; Documents where the rank & frequency relationship foiiowed Vapor Pressure model 
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The last family of distribuiions was the Power Family that involves raising one or more 
parameters to the power of the independent variable, or raising the dependent variable to 
the power of a given parameter. This family is generally a set of convex or concave 


curves with no inllection points or maxima/minima. Nine documents out of thirty one fall 
under this category. The fit parameter for these distribuiions is given below: 


[ 


t 

i 

r. 

i- 

’I 


Modified Hoerl Model y = ah ^ x" 


File Name 

No- of 

Word.s 

l]iiic|ye 
wo rtfs 

Zipfs 

Coeff. 

Relationship between 
Rank & Frequency 

a 

b 

€ 

mill- 

subjection- 
11 7.txi 

45240 

4885 

-i.33 

Modified 1 loerl Mode! 

7004.18 

0.42 

-1.06 

librarys.lxl 

37498 

2809 

-1.00 

Modified Hoerl Model 

4769.28 

0.58 

-1.05 


Table 5.7: Documents where the rank & frequency relationship followed Modified Hoerl model 


llocr! Model j’ ah' x' 


File Name 

No- of 
\\ ords 

wcircis 

Zipfs 

Coeff. 

Relationship 
between Rank 
& Frequency 

a 

b 

c 



aristotle- 
nieteorolog\ - 
80.txt 

43470.0(1 

3186.00 

-1.30 

! ioerl Model 

3845.7 

0,99 

-0.80 

banie-peter” 

277.1x1 


4788.00 


1 loerl Model 

2252.21 

0.98 

-0.47 

Iwaiii-ltuii-’IO l\l 

p.MHCiOII ; 

7.|%s,00 

.1..’4 

lloeil Model 

I4').|,.S.l 

0.98 

-0.57 

hiroshima 

nagasaki.ixl 

1 25341.00 

i 3448.00 

1 

■1.24 

1 

1 

!h)crl Model 

2199,15 

1 .00 

-0.95 

eidgaah.txt 

t ■ ' 

i 

i 

1 4951.00 

p" 

j 

1 1497.00 

1 

-0.82 

Hoerl Model 

168.59 

0.98 

-0.37 

iirdu.txl 

' 4035.00 

424.00 

I 

i 

1 -0,81 

Hoerl Model 

151.06 

0.99 

-0.5! 


Table 5.6: Docimienis wltere the rank frequency relationship followed Hoerl model 
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There was one docunicnl that adhered to Power Fit of the form y = ax^. In model 

I yijjjjg there is a set ui' rules, but not a hard and fast rule as there is no model, which is 
the only best model. ! Ite value of a model lies in the efficacy with which it performs the 
task for which it lia.> been coii.siructcd. Unfortunately, many functions in real world 
situations are nonlin-.vir in parameters. Some nonlinear functions can be linearized by 

* traiislbrniing the iiuicpeiuieiil and'ur dependent variables. Hut we often encounter 
functions that cannot be luieari/ed the pri>blcm of e.stiniating the nonlinear parameter 
arises. This paper discusses the approaches followed in nonlinear curve fitting. Zipfs 
Mandelbrot appioach is Inised on .Shifted power di.stribution. This distribution can be 
linearized and tliis is the way one finds the Zipfs coefficient. 

■ 

The aim of curve fitting in thi.s section was to highlight the similar nature of documents. 
We have been partially sucecssful in doing this. We could classify the documents in 
groups that show similar fits. This implies that these are the documents that are similar as 

: far as the distribution of rank and frequency are concerned. It is clearly visible with 

. respect to the Zipfs coefficients. 

I 

i 

f 

I 

j 

1 

I 
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Robustness ol Zipfs Law 


I'liis scclion was dvvolci! lu ihc .second objective that whether tlie Zipl’s law can be 
applied in these literatures. 


According to Wikipedia*. "Robu.stness is the quality of being able to withstand stresses, 
pressures, or changes in procedure or circumstance. A system, organism or design may be 
said to be "robust" if it i.s capable of coping well with variations (sometimes 
unpredictable variations) in its opeiating environment with minimal diimagc, alteration or 
loss of functionality", In staiisiica! terms, a robust statistical test is one that performs well 
even if its assumptions are violated by the true model from which the data were 


generated. 


Kawamura & i laiano’ (ZObd) introduced a simple and generic model that reproduces 
Zipfs law, Thc\’ used logarillnnie scale to address the time evolution of the model as a 
random walk and explained hinv the model rcpriKluees Zipfs law. The explanation shows 
llial the Ix-haMor of lire iimdel iN veiv robust and uni\'ersa!. According to K.nudscn'’ 
(2001), "Zipfs law for cilic.s is r.)ne ot'the most conspicuous and robust empirical facts in 
the social sciences". . According to Marsili et al.' (1998), “Zipf half a century ago, found 
that city sizes obej, an asioni.shingly simple distribution law, which is attributed to the 
more generic leosi effort principle of human behavior.... While individuals interact, they 
make a compromise of preference. Somehow the ensuing compromise results in a robust 
statistical distribution, Zipfs law". According to Levitin^ (2003), "One may conclude 
that the ubiquitous appearance of Zipfs law is based on two independent effects. The first 
is the fact that very general transition probabilities lead to Zipfs law. The second reason 
why Zipfs law is found so often is probably based on the ranking procedure, which 
makes Zipf structures empirically observable because they are robust under its 
application”. Ferrer-i-C’ancho & .Sole' (2001) commented that Zipfs law has been a 
popular achievement of quantitati\c linguistics. Zipfs appears to be robust. Many models 
of syntactic communication a.ssume this law. It is an obvious ingredient for any theory of 
language evolution. A complete theory of language requires a theoretical understanding 
ofits implicit statistical regularities. 
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We had taken 31 documents spanning time, typology and language. We had chosen a 
sample that is quite varied. It contains some documents that are probable outliers. 
Outliers in the sen.se that they are not delincd us documents. They are basically a 
collection of words arranged in some meaningful order. For example, the English 
German Business dictionary, where the words are not bound in any consequential manner 
but they aie there because of the alphabet they begin with. Similar is the case of the 
Sanskrit text, where the words pertain to the synonyms of the name of lord Vishnu. Let us 
revisit how Zipf s law^ performed in these documents 


File Nsiiiie 

No- of 
Words 

Zipfs 

Coefficient 

frank! in- 
autobiography- 
244.txt 

68157 

-1.37 

wollstonecrait- 

maria-I96.txt 

45874 

-1.37 

Jefferson- 

autobiography- 

73,txt 

40648 

-1.37 

aiigustinc” 

coofessions- 

276.txt 

176014 

-1.35 

chaucer- 

99403 

-1.35 

canterbury- 102.1x1 

Arabian nights 
cnlcrtainrncnts.txl 

‘)0768 

-1.33 

mili-.subjccliuii- 

217.t,\t 

■l.‘524(J 

-1.33 

freud-young- 
763.1X! 

72133 

-1.3 

iirisfollc” 

mclcorology-KO.txl 

43470 

-1.3 

lockc-conccrning- 

lll.txt 

53786 

-1.29 

barrie-peter- 

277.txt 

47885 

-1.29 

berkeley-treatise- 

!77.t.\t 

36342 

-1.29 

biiiiyan-pilgrims- 

304.txt 

57122 

-1.28 


anonymous- 

beowulf-543.lxt 

27129 

-1.27 

dickens-christnias- 

125,txt 

21818 

-1.25 

Hiroshima 

nagasaki.txt 

25341 

-1.24 

twain-tom-40.txt 

24486 

-L24 

lucretius-on- 

395.txt 

75386 

-1.23 

kcats-endyinion- 

484.lxt 

31962 

-1.23 

365 foriegn 
dishes.txt 

27891 

-1.22 

The arctic 
(|iia*n,lxl 

16703 

-121 

shakcspcarc- 

hamlet-25.txt 

33098 

-1.2 

shakcspcarc- 

romeo-48.txt 

26784 i 

-1.15 

library s.txi 

37498 

-1 

aladdin gcr.txl 

17686 

-0.92 

aladdin eng.txt 

5319 

-0.92 

eidgaah.txt 

4951 

-0.82 

urdu.txt 

4035 

-0.81 

Eng-ger- 

busDictionary.txt 

10089 

-0.66 

Eng-ger- 

busDictionary.txt 

9107 

-0.63 

sanskriuvork.txi 

1411 

-0.54 


Ta!)ie 5.8: Zipf's coejjkknts of various documaits 


The mean Zipf s coefficient was found to be -1.153 with a standard deviation of 0.2388. 
As envisaged, Zipf s law is pretty robust across the documents, but for the ones 
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iiieiitioned above. I he leasons lor this could be elaborated in the following discussion, 
por this we have to go into the genesis ol the typology of these documents. Let us begin 
our discussion on the dictionary first. 

! Dictionary is defined as “a reterence book containing an alphabetical list of words, with 
I information given lor each word, usually including meaning, pronunciation, and 

etymology", hi other words, it is a book listing the words of a language with translations 
into anotlier language (as is the case here- English to Gennan dictionary). In bilingual 
iliclionaiics, each enlty lias liaiuslatioas of words in another language, l or example, in u 
German-English dictionary, the entry ‘kosten’ has a corresponding English word, ‘cost’ 
i and the entry ‘gesetz’ has a corresponding English word, ‘law’. In many languages, 

[ words are grouped together according to their true or normal origin ("root"), and these 
roots are arranged alphabetically. So now we can say that a dictionary is a large corpus of 
! words collected in a certain manner. But the principle of least effort is not observed here. 

I This is the precise reason of why Zipfs law can not perform well in this type of corpus. 

: Sanskrit text taken here is also a different type of corpus. This document is about 

I 

• synonyms of the names of lord Vishnu. Hence the repetition of words was not expected, 
i The rcpclilioiis which arc prc.sciil arc basically the connecting words or explanatory 
i words, So there \ . as no question of principle of least effort present here. This is reason of 

‘ bad performance ol ZipFs law in this document. 


• 1.2 

-1.4 

- 1,6 

Figure 5.1 : Box plot of Zipfs coefficients of various documents 
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The above figure shows a boxplot of the Zipfs coefficients. Box plots are summary plots 
based on the median, quartiles, and extreme values. The box represents the inter-quartile 
range which contains the 50% of values. The whiskers are lines that extend from the box 
to the highest and lowest values, excluding outliers. A line across the box indicates the 
median. The three document.s mentioned above are outliers as far as Zipfs coefficients 
are concerned. But for these documents, we can conclude that Zipfs law is robust across 
literatures. 



Inter-litcrature comparison of the applicability of Zipfs law 

This section pertains to the third objective that is to compare the applicability of Zipfs 
law among dillerent documents selected. For this we have employed cluster analysis to 
find cluster of documents that are “similar”. 

Literal meaning o! clusleiing is to gather, to congregate or draw t(jgether. In tcmis of data 
nianagement, clustering means dividing the data in such a vvay that similar data points 
come together. 1 he objective ot clustering is form groups that are heterogeneous but 
homogeneous within. Clustering is thus a method to divide a database into clusters that 
can be used ior classihcation purpose. However, classification segments the data into 
groups that are already defined. Clustering facilitates segmentation of the data into 
groups that are not previously defined. 

This study is thus intended to make an in-depth study on various document-parameters 
through cluster analysis to device a tool to fomiulate group(s) of documents that are 
'dirfcivnl'. I'lic main puipo.se tif this study is to build a sound logic and deduce how the 
documents can be classified into different heterogeneous groups that are homogeneous 
within and try to find reasons that make the group(s) ‘different’. 

Clustering 

According to Berry and Linoff** (2001), “Clutter Analysis is an important human activity. 
Early in Childhood, one learns to distinguish between cats and dogs, or between animals 
and plants, by continuously improving subconscious clustering schemes”. With the help 
of clustering one can segment the data into small similar regions and thus comment on 
the overall distribution patterns of the data. Clustering is done on the basis of a similarity 
measure - attributes/variables to derive the clusters so that data points in one cluster are 
more similar to another (homogeneous) and data points in separate clusters are less 
similar or dissimilar to the data points of another cluster(s) (heterogeneous) (Anderberg^, 
197.3). (Tustering methods arc discussed in various text books (1 lartigan'**, 1975, Jain & 
Dubes", 1988). fhese methods have various techniques and can be performed in many 
ways. There are a few methods that start by considering all records to be part of one big 
cluster and then split them into two or more smaller clusters. On the other hand, there are 
methods that start with each record taken as a cluster, and iteratively combine to form 
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clusters. The former methods are called Divisive methods and the latter Agglomerative 
methods (Romesburg , 1984, Kaulman and Roussecuw'^, 1990). Another method is 
grouping ol' two closest objects as a single cluster and thus number of objects is reduced 
to « -1. 1 hen next two objects are grouped and the process continues till all n objects are 
covered under single cluster. Here the clustering is done step-by-step and the method is 
known as ""hierarchical clusiering' (Romesburg'", 1984). 

For applying clustering techniques, data is arranged in two matrices, called “data matrix” 
and ■ dissimilanty matrix . While data matrix is a representation of n objects (such as 
sludcnts) wilh in allrdiutes (such as gender, program, region, age. soeial slalus etc.), 
dissimilarity matrix is a collection of distances between the pair of objects. Data matrix 

can be shown as Ibllows: 


X, 


M//} 


km 




Dissiinilarilv matrix can be shown as follows: 


0 

r/(2.1) 0_ 

r/(3,l) r/(3.1) 0 

|_(/(n.l) t/(n,2) ••• 


0 


If the distances in the matrix are near to zero then the objects are highly similar or “near” 
to each other. 

Calculation of Distances 

According to Han and Kamber''* (2001), one could come across various types of variables 
while clustering the data. The entire variables that we have in this section are interval 
scaled variables. These variables are continuous measurement of a roughly linear scale, 
® weather temperature and weight and height etc. One can find the distances between 
llie objects. This is called the Euclidian di.stancc (d) and is defined as 
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where, i and j are two m dimensional data objects represented by (Xji. Xji xb ,,, Xim) (xji, Xj 2 , 

Xj3, ... )• 

To exhibit how clustering methods help in defining the distance between documents, a 
(kitaset 31 documents was taken and was arranged on the basis of inliarinalion on 
parameters like Zipf Coefficients, Number of sentences, Syllables per word and Word per 
sentence. On the basis of this we were able to get the following cluster of documents. 
Number of Cases in each Cluster is as follows 



Final Cluster Centers 


1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

Sentences 

3428 

4141 

7139 

2451 

494 

10610 

11160 

13578 

22974 

5299 

Syllables per word 

1.62 

1.57 

1.55 

1.63 

2.12 

1.60 

1.49 

1.54 

1.53 

1.93 

Word per sentence : 

6.53 

7.15 

8.07 

6.81 

7.62 

7.83 

6.46 

7.32 

7.66 

6.70 

Zipfs Coefficieii! 

-I.I6 ^ 

-1.23 

-1.31 

-1.21 

-.77 

-1.28 1 

-1.30 

-1.35 

-1.35 

-1.14 



€.'asc.s 


1 

4 

inill-subjcction-217.txt, alomic bomb !iii().sliiina nagiisaki.txt, 365 foricgii tii,slic.s.lxt, 

The arctic quccn.txt 

2 

4 

chaucer-canterbury-102.txt, Arabian nights entertainments.txt, locke-conceming- 
1 1 1 .txt, Eng-ger-busDictionary.txt 

3 

3 

frank 1 in-autob iography-244 .txt, barrie-peter-277 .txt, keats-endytn ion-4 84.txt 

4 

1 

4 

freud-young-763.txt 

5 

dickens-christmas-125.txt, lucretius-on-395.txt, shakespeare-haiinlet-25.txt, urdu.txt 

6 

2 

aristotle-meteoroiogy-SO.txt, librarys.txt 

7 

I 

sanskritwork.txt 

8 

I 

berkeley-ircatise- 1 77.txt 

9 

I 

1 

1 aladdin ger.txt 

10 

!0 

jefferson-autobiography-73.txt, wollstonecraft-mana-196.txt. augustine-confessions- 
276.1x1, bunyan-pilgrims-3()4.txt, anonymous-bcowuir-543,txt, twain-tom-40.txt, 
sliakcspcarc-roinco-48.lxt, aladdin ens.lxt, eidgaali.txl, Eiig-ger-b^asDictionary.m 


Table 5.9: Results of Cluster Analysis of Documents 
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It is quit^ evident troiii the analysis that clustering techniques are not quite successful in 
I segregating the documents into heterogeneous groups that are homogeneous within. So 
decided to include iiioie pai'aiiiciers. Ihese parameters were least effort %, Conlain% 
and Flesch Index., also we deleted sentences as a variable. With these 6 parameters we 

;igain proceeded lor ihc C luster Analysis and obtained the Ibllowing results. 


Final Cluster Centers 


Cluster 

; 

2 

3 

4 

5 

6 

7 

8 

9 

to 

Coiitaiil % 

16.75 

55.04 

.36.3 1 

44.87 

34.30 

61.31 

18.56 

2.69 

53.76 

62.98 

Flescli Iiukw 

»9.95 

69.52 

60.77 

67.35 

28.74 

59.31 

-82.83 

-79.97 

50,09 

74.23 

Lcii.st I'U'fbrt 

13.90 

10.93 

9.K7 

17.22 

30.24 

7.93 

15.13 

7.80 

11,2! 

6.97 

Syllables 

2.54 

1.54 

1.65 

1.57 

1.99 

1,64 

3.41 

3.33 

1.76 

t.49 

Words Per Sentence 

1.75 ' 

7.19 ^ 

6.32 

6.63 

9.80 

8.98 

1.57 

4.99 

8.14 

7.04 

ZipfCoefllcient 

j -.66 

-1.19 

-.87 

-L24 

-.82 

-1.32 

-.63 

-..54 

-1.27 

-1.28 

'Tiit)lc 5.10: /'V""/ ( 'iiisUT ( \ 

^iiiers cl' 

Document Parameters 






The meaning of above table is e.xplained here. Suppose we take the fourth cluster. In the 
foiirlh cluster, there will be ilocuiiiciits whose ine.in contain pcreciitage would he 44.87%. 
They will have mean ideseh Readability Index of 67.35. The documents would be written 
by e.xerting mean el fort ol 1 7.22%. T hese documents w'ould have on an average 1 .52 
syllable per scnlenee. 6.63 words per sentence and the Zipl s eoellieient ol these 
documents would be around -1.24. 

Now clustering algorithm would work like this. It will find the distances of documents 
from these cluster centers on these parameters and classify the documents accordingly'. A 
document woul i fail in a particular cluster it it is within acceptable distance Irom that 
cluster. We have used the SPSS .software for doing this analysis. It finds out the cluster 
iiicnibcr.ship ot' the document on the basis ol these distances and also calculates a 
cumulative distance which comments on the membership of a particular document. If a 
document is pretty far from the cluster centre then that document would have a week 
membership of that cluster and vice-versa. 

We obtained the clusters of the documents in the following manner 
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has classif 
ient is a 
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segregation. This classification also substantiates the results obtained in previous 
sections. It also highlights the comparability of documents as far as the applicability of 
Zipfs law is concerned. The motive of this section was precisely that. 




Let us discuss the above table from the point of view of applicability of Zipfs law. In 
cluster 1 we have only one document and that is English German Business Dictionary 
(German words). This document has a Zipfs coefficient of -0.66 and is a different type of 
corpus, which we have discussed earlier. So it ought to be in a unique cluster. Now let us 
come to the second cluster, it has six documents namely shakespeare-hamlet.txt, 
shakespeare-romeo.txt, Iucretius-on-395.txt, barrie-peter-277.txt, chaucer-canterbury.txt 
and aladdin eng.txt. The Zipfs coefficients for the first five documents are ranging from - 
1,1.5 to -1.35. The only exception in this cluster is aladdin eng.txt which has a Zipfs 
coefficient of -0.92, but if one see the distance of this document from the cluster centre it 
is the highest. This means it is tending towards the third cluster which has two documents 
aladdin ger.lxt and urdu.txt with Zipfs coefficients of -0.02 and -0.81. 


.'luster -4 has four documents: keats-endymion-484.txt. diekems-christmas- 125.txt, 
inonymous-beowulf-543.txt and the arctic queen.txt. All these documents have a 
universal Zipfs coefficient of about -1.20. So there is uniformity within this cluster. 
C'liistcr 5 lias one document namely eidgaah.txt with a Zipfs coefficient of -0.82. We 
have earlier discussed about the peculiar and different nature of this document. Cluster 6 


has four documents namely aristotle-meleorology-80.txt, Arabian nights 
entertainments. t.xt, locke-concerning.t.xt and franklin-autobiography _44.txt. These 
documents have a Zipf s coetheient of around -l.oO. 


Cluster 7 and 8 again pertain to peculiar documents namely eng-ger bus Dictionary.txt 
(German words) and sanskritwork.txt respectively. These documents have Zipfs 
coefficients of -0.63 and -0.54. This was expected in view of the nature of these 
documents. Cluster 9 has six documents namely jefferson-autobiography-73.txt, 
vvollstonecraft-maria-196.txt, mill-subjection-217.txt, hiroshima nagasaki.txt, berkeley 
Ircalisc.txl and librarys.txl. The Zipfs coefficients of these documents are in the range - 
1 .29 to -1.37. There is one exception that is the librarys.txt which has the perfect Zipfs 
coefficient of -1. Again we can justify this by saying that this document is very distant 
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from cluster centre of cluster 9. Cluster 10 has five documents. All of them have Zipfs 
coefficients in the range -1.22 to -1.35, which is again very uniform. The documents 


which belong to this cluster are bunyan-pilgrims-304.txt, freud-young-763.txt, 365 
foreign dishes. ixt, augustine-confessions.txt and twain-tom-40.txt. 

What we have achieved in this section is the fact that Zipfs law is applicable in not so 
robust manner in those documents which have peculiar parameters. The Zipfs 
coefficients depend on the nature of parameters that we have defined in the preceding 
discussion. We can also say that these parameters are .successful in comparing the 

! documents vis-a-vis aiiplicabilily of Zipfs law. 

I 

I 

> 

1 

I 

1 

1 

f 

I 

I 

J 
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Summary & Conclusions 

Zipl lorniultitcd a law in 19j)0 that says fretiuency count (number of occurrence) of words 
in any text is inversely proportional to the rank of that word. Frecjuencies count of the 
words is the number of occurrences of the words in that text. The words are then arranged 
in the decreasing order of ftct]uency so that the most frequent word gets the highest rank. 
Ziph in his llrst thesis. Relative Frequency; A Determinant of Phonetic Change” wrote, 
“( tiiserviii).', !lic sjK'celi ol iiiaiiy !iiiuiiivi.ls dI iiiillion.s dI people, we luive ilcuuiu.struled, in 
part actually, in part by induction, that the conspicuousness or intensity of any element of 
: language is inversely proportionate to its frequency. Zipfs Law approximates the 

relationship between rank and frequency of any text. The text should consist of at least 
5000 words in order for the produc t of r *f to be reasonably constant. Zipf attributed this 
law as a consequence of “Principle of Least Effort". Zipfs distribution plays a central 
role in the modeling of human activities, particularly of the variable studied in 
bibliometrics and scientometrics. It is called “one of the most puzzling phenomena in 
bibliometrics”. Zipf (1949) in his work, “Human Behavior and the principle of least 
: effort” viewed language as a "tool" that is shaped by its "jobs" in human society. Other 


i works of Zipf were “Selective Studies and the Principle of Relative I'requcncy in 
Language” which as published in 1932, “Psycho-Biology of Languages” which was 
s published in 1935 and “National Unity and Disunity: The Nation as a Bio-Social 
Organism” which was published in 1941. In the study “Psycho-Biology of Languages” 
Zipfs goal was to put language study on a par with exact sciences, by use of “statistical 
j techniques”. It was an attempt to prove that the key to the explanation of all synchronic 
and diachronic language-phenomena ha.s been found in a stati.stically estimated tendency 
to maintain equilibrium between size and frequency. 

; There has been a debate as to if Zipfs law follows a Power-law. or "stretched 
i exponential" (Weibull) or "log-normal" or "Yule distribution”. There are two Zipf 
“laws": the rank- frequency one and the frequency-count one. There are many directions 

I of thoughts about Zipfs Law. Some of them ai’e like as follows: 

O Zipfs law can be derived from stochastic processes 
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(I) /-ipf s is Bose — I'.instcin lorni of the classical occupancy tnodcl 

(I) Zipfs htvv is a Negative Binomial Model 

(I) Zipl's law is a Logarithmic Series distribution 

cb Many of the classical occupancy model can be manipulated to yeild hyperbolic 
distributions. 

d) Zipfs law is a Beta function 

d) Zipfs law is a cumulative advantaged distribution 

d> Zipf s law is information theoretic approach to study the statistical structure 

d) Works based on the field of quantitative linguistics is dependent on Zipfs law 

d) Laplace’s law of succession isshown to be the ‘Zipfian’ frrequency analogue of 
the Bradford Law 

d' The Diserele Ciaussian lixponeiilial (l)(iX) as defined by propo.scd PDf reduces 
to Zipfs law as p— +oo etc. 

Zipf did show that an astonishingly wide range of phenomena... exhibited distributional 
behavior that could be approximated by his ‘Law’. But many people commented that it 
applies to the distribution of only social characteristics and the relative frequency of 
occurrence of the de.scriptors of science/tcchnology did not conform to the theoretical 
distribution of Zipf. They argued that too much of emphasis has been placed on this result 
(Zipfs law). They commented that Zipfs law is theoretically elegant, but it provides 


additional parameters. One major finding was that all the empirical distributions can be 
divided into two types. These are Gaussian type (G-type) and Zipfian type (Z-type).These 
Z-type distributions have no moments whatever. Zipf distribution of a document employs 
the frequencies of the words forming that particular document so the contextual similarity 
can be assessed by it on the basis of numerical encoding produced by the particular 
distribution. 

Zipfs Law has plethora of applications in the modem times. Many researchers have 
applied Zipf s law in city populations. It is used in modeling urban growth patterns and 
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the Self-Organizing Economy. A model of a large-scale city formation is developed 
j using Zipf s law. Others developed an intermittency model for urban development and 
modeled interacting individuals using Zipfs law. 

It is claimed the Zipfs law governs many features of the Internet. Zipfs law has 
implieations Ibr the search strategics u.scd in P2P networks. Web requests from a fixed 
user community are distributed according to Zipfs law. Zipfs law is used in Web Access 
Statistics and Internet traffic like caching relay for the World Wide Web. A model is 
proposed based on Zipfs law for software reliability analysis. 

Zipfs law has applications in finance and business also. Many empirical size distributions 
in economics and elsewhere exhibit pow'er-law behavior in the upper tail. Zipfs plots and 
j the size distribution of firms are related. Zipf distribution thus characterizes firm sizes. 

Zipf distribution forms a microeconomic model in which individual agents interact to 
form productive teams. 

■ Zipfs law is applied in many other areas like ecological systems, genomic data, 
i earthquakes and clinical diagnosis etc. Zipfs law is applied in ascertaining importance of 
I genes for cancer classification using micro array data. It is found that inverse power 
’ relationship between the rank order ol diagnosis and the Ircquency ol the appearance ol 
i these diagnoses exists (Ziplbs Law), d'herc are many more examples like Zipfs law in 

■ percolation, in immune system, in liquid gas phase transition of nuclei and in psychiatric 
; ward. 

There have been many applications of the law in natural languages, like English, Chinese, 

Voyanich manuscript and random texts etc. Universality ot Zipf s law and the differences 
between all languages on Earth tempted researchers to think that its explanation has 
something to do with language. Zipf s law can be rooted in a language structuring 
process of coding, which adds redundancy necessary for language understanding. Zipfs 
Law provides a distributional foundation for models of the language learner s exposure to 
segments, words and constructs, mid permits evaluation of learning models. 

There have been many studies to study the eftect ot ditferent corpus. Some of these 
examples include legal texts. Brown corpus ot 1 million words of American English, 

Isrge corpora in two languages, English and Mandarin, Chinese corpus, Eldridge s 
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distribution of word usage in four Americar. newspaper articles, Brugmann’s study of 
four plays in Plautine Latin, noun frequency in Macaulay’s essay on Bacon, Russian 
corpus, Anthony and Cleopatra, Richard III, novels such as ‘Withering Heights’ by Emily 
Bronte; ‘Sense and sensibility’ by Jane Austin, Voynich manuscript, Hindi and Urdu 
texts, Greek corpus, French corpus, technical writing, spoken American (verbatim) and 
samples of adult speech etc. It is commented that Zipfs law is a reflection of a specific 
property of the organization of human memory, which usually operates with more 
frequent language units in all cases of the spontaneous use of speech. 

j Present work was carried out with the objective of finding the interrelationships between 
the rank and the frequency of a word in selected literatures; test whether the Zipfs law 
' can be applied in these literatures; do an inter-literature comparison of the applicability of 
Zipfs law and attempt mathematical modelling. Few hypotheses were assumed to be true 
; such as the principle of least effort is a universal phenomenon; all writers would follow 

i an economy in the use of words irrespective of the language concerned and the rank- 

! frequency distribution of words w'ould be similar in all languages. 

5 For inter-literature comparison of the applicability of Zipfs law, the study had selected 
i the few set.s of texts from diverse literatures. 31 .sets of texts were selected from computer 
' science literature, Hindi, English, German, Urdu, Sanskrit, a technical subject like 
1 Library Science, and a dictionary. Other sources include public domain electronic texts 
(e-texts) in the areas of American and English literature as well as Western philosophy. 
These were "classic" texts that have s tood the test of time. They also encompass a huge 
time period- as far back as 400BC to the present. Also taken were popular e-texts like 
I “365 Foreign Dishes”, “The Arabian Nights Entertaiiunents’’, “The Arctic Queen” and 
“The Atomic Bombings of Hiroshima and Nagasaki”. 

The Software for calculating the word frequency from the texts used in this work is 
“TextSTAT”. All unique words were ranked at random according to their frequency of 
occurrence in a decreasing order. Different ranks were assigned to each of them 
according to Zipfs approach of random-ranks. Microsoft Excel has been used 
extensively to “sort” the data in the first place and “advanced filter” feature of the Excel 
is used to filter out the unique frequencies. Once the Zipfian data has been obtained for 
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the various files regression analysis and curve-fitting was done on this data. A linear fit 
was done in order to tind the applicability of Zipf-Mandelbrot law. We have used various 
statistical packages like SPSS. Minitab and Curve Expert to carry out these analyses on 
the selected texts. 

Major Findings 

The major findings of this work are: 

1. Zipfs law is applicable on random text in English language from Computer 
Science 1 terature. Random texts do follow Zipfs law; however the exponent 
varies from text to text. The method of random rank performs inferiorly to the 
maximal rank method and the tied rank method proposed by authors. 

2. The distribution of words according to their length and the hits they are able to 
generate on the popular search engine “Google” follows Zipfs Law. It is a Zipf 
type distribution with exponent not close to unity (In fact it came out to be -3.51). 

3. Zipl^s Law is applicable in English Literature (Aladdin and the Wonder Iximp) 
and for the Mandelbrot Zipfs law ( g(r) = a(r -i- bf ) the coefficient c in this case 
is -0.92. 

4. Zipfs Law in German Lilcralurc (Aladdin and die Wundcrlampe) produced a 
coefficient of -0.92. 

5. Zipfs Law is not applicable for English-German Business Dictionary (Mr. 
Honey's Small Business Dictionary (English-German), the coefficient in this case 
is -0.66 that is not close to -1. 

6. Zipfs Law is applicable in Hindi Literature (Eidgaah by Munshi Premchand), the 
coefficieiil is -O.I:>2. 

7. Zipfs Law is applicable in a text trom Library Science Literature ( The Library , 
by Andrew Lang). Tlie Zipfs coefficient here is perfect ‘-1 . 

8. Zipfs Law is applicable in Urdu Literature (Bisat-e-Hyder by Hyder Zaheer 
Ansari Hyder.), the coefficient c in this case is— 0.81. 
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9. Zipfs Law is not applicable in this piece of Sanskrit Literature (“Sri Vishnu 
Sahasranaaniuin”). for Ihc Mandelbrot ZipLs law ( g{r) - a{r f bf ) the 
coefficient c in this text is -0.54 

10. In an attempt to relate Flesch Readability Index and Zipfs law, we finally found 
that 21 documents are belonging to type V. These documents have different 
readability indices, belong to a different genre and belong to dilferent time 
periods but have almost similar value for the Zipfs coefficient. This indicates that 
readability has little to do with the Zipfs coefficients. 

1 1 . We defined least effort % (the ratio (percentage) of the unique words to the total 
number of words in the data. It reveals the “effort” that the writer has done in 
communicating his ideas. The smaller the percentage the less is the effort of the 
writer) and contain %( the ratio (percentage) of the sum of rank trequencies of the 
Zipfian data of the document to the total number of words in the data. It reveals 
the amount by which the Zipfian data is able to capture the document. The higher 
the percentage the more is the better the containment of document in the Zipfian 

data). 

12. factor Analysis revealed that faelorl comprising ofconlain'M) and the flesch 
readability index together with the least effort% are explaining 92% of the 
variance of the data. 

1 3 . Multiple regression analysis to predict the dependent variable Zipfs coefficient 
revealed that contain % alone explains 82.3% ot Zipfs coetficient. 

14. The Zipfs coefficient of the documents can be predicted well by the variable 
contain% only and the variable least effort is not lequiied. 

15. Curve fitting was applied with partially success to highlight the similar nature of 
documents. We could classify the documents in groups that show similar fits. This 
implies that those documents which are that are similar as far as the distribution of 
rank and frequency (Zipfs coefficients) is concerned are classifiable with the help 

of Zipfs Law. 
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16. There were three documents out of thirty one which can be deemed as ‘outliers’ 
as far as Zipf s coefficients are concerned. But for these documents, we can 


conclude that Zipf s law is robust across literatures. 

i 

17. Application of Cluster Analysis helped us in going further deep. Clustering has 
given us i meaningful segregation as it has classified the documents on the basis 

I of their characteristics. The Zipf s coefficient is a testimony of this segregation. 

18. Zipfs law is applicable in not so robust manner in those documents which have 
peculiar parameters. The Zipf s coefficients depend on tlie nature of parameters. 

i Suggestions for Future Research 

I The present study concentrated on “A comparative study of robustness of Zipf s Law 
across literatures”. The study was more intended to study the “between” documents 
comparison. Keeping this delimitation in view, a number of suggestions can be put 
forward for future research in tlie arca;- 

1 . A more comprehensive sample of documents “within” a subject can be taken 
to study the inter-literature variability of applicability of Zipf s law. 

J 2 There is a vast scope of future research on technical subjects such as Library 

3 ■ 

j Sc Information Science, Management Science and Computer Science. 

3. More research is required to be conducted to study the applicability ol Zip! s 
law in Indian Languages like Sanskrit, Urdu and Hindi. 

4. Since India is a country of many regional languages. Ihese applications may 
be extended to the regional languages ot India like 1 amil, Telgu, Aw'adhi, 

; Punjabi, Bengali and (Jriya. 

5. More research can be conducted to study the principle of least etfort in 

context of Indian regional language. This will help testing the belief about the 
ricliness of these languages. 
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Appendix 1 


Word Frequency distribution of 365 Foreign Dishes 
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Appendix 4 

Won! Fm/iicmy distrilmtum o/n'lie Arafiiaii Nights EnWrluinmcnts 


W 

man 

soon 

there 

see 

been 

having 

good 

ver y 

myself 


W 

another 

make 

wife 

how 

long 

c an 

also 


75 \ broth 
74 ! afterv 


footnote 


s 


many 


d 


done 


69 


68 



upon 

214 

prince 




a 


house 


heard 


life 


captain 


addressed 


■d _ J 

' 6 

11 

a 


him 694 

where 

203 

not 608 

no 

199 

tills 589 

went 

198 

which 586 

out 

l96 

at 578 

before 

191 

for 575 

if 

186 


about 

place 

saw 

brought 

being 

again 


answered 


approached 


agreement 


merchant 


e 


addressing 


lutenberg 93 

has ^ 

should 91 


toward 


erceived 


opened 

death 

morgiana 


accompanied 


access 

accepted 


made 


us 


came 

our 


ali 

emperor 


abode 

aboard 


two 

155 

■ 

vizier 

85 

awa' 


appeared 


dervish 


81 


bahman 



abroad 


abandon 


abide 


abashed 



account 



























































































































Appendix 6 

fVon/ Frequency distribution of Meteorology by Aristotle 
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Word Frequency distribution- 


Appendix 8 

“Confessions and Enchiridion bv Saint Aueustine” 
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Appen dix 9 

Word Frequency Distribution of “The Pilgrim's Progress, by John Bunyan” 
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Appendix 10 

Word Frequency distribution of Peter Pan by James M. Barrie 
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Appendix 12 

Word Frequency distribution of- A Treatise Concerning 
‘The Principles of Human Knowledge” by George Berkeley 
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Appendix 14 

Word Freifuency distribution of Operating System - Concepts and Design by Milan Milenkovic 
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Appendix IS 

Word Frequency distribution of “A Christmas Carol by Charles Dickens 






































































































Word Frequency distribution of “Mr. Honey's Sntali Business Dictionary” 
(English-German) by Winfred Honig (English words) 
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Word Frequency distribution of “Mr. Honey's Small Business Dictionary” 
(English-German) by Winfred Honig (German Words) 
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i Appendix 19 

fV ord Frequency distribution of The Autobiography of Benjamin Franklin 
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Appendix 20 


Word Frequency distribution of “A Young Girl's Diary” 
Prefaced with a Letter by Sigmund Freud 
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Appendix 22 

Word Frequency distribution of “Endymbn: A Poetic Romance” by John Keats 

















Appendix 23 

Word Frequency distribution of “The Library” by Andrew Lann 





















































































Appendix 24 

Word Frequency distribution of "'Concerning Civil Governmen”t, Second Essay- an essay 
concerning the true original extent and end of Civil Government, by John Locke, Chapter I 
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Appendix 25 

Word Frequency distribution of “On the Nature of Things” by Titus Lucretius Cams 
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Appendix 26 

Won! Frequency distribution of “The Subjection of Women” by John Stuart Mill 
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Appendix 27 

Word Frequency distribution of Sanskrit- “Sri Vishnu Sahasranaamam” 
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Appendix 28 


Word Frequency disirihuiion of**IIamIeF' by Shakespeare 
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Appendix 29 


Word Frequency distribution of “Romeo and Juliet" by Shakespeare 










































































































Appendix 30 

Word Frequency distribution of “Tom Sawyer, Detective” By Mark Twain 
from "The Writings of Mark Twain, Volume XX 
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Appendix 31 

Word Frequency distribution of “The Wrongs of Woman” by Mary Wollstonecraft 
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